Next Article in Journal
Dynamic Time Warping Algorithm in Modeling Systemic Risk in the European Insurance Sector
Previous Article in Journal
Propositional Kernels
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Information Loss of a Stochastic Map

by
James Fullwood
1,* and
Arthur J. Parzygnat
2,*
1
School of Mathematical Sciences, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China
2
Institut des Hautes Études Scientifiques, 35 Route de Chartres, 91440 Bures-sur-Yvette, France
*
Authors to whom correspondence should be addressed.
Entropy 2021, 23(8), 1021; https://doi.org/10.3390/e23081021
Submission received: 13 July 2021 / Revised: 3 August 2021 / Accepted: 4 August 2021 / Published: 8 August 2021
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
We provide a stochastic extension of the Baez–Fritz–Leinster characterization of the Shannon information loss associated with a measure-preserving function. This recovers the conditional entropy and a closely related information-theoretic measure that we call conditional information loss. Although not functorial, these information measures are semi-functorial, a concept we introduce that is definable in any Markov category. We also introduce the notion of an entropic Bayes’ rule for information measures, and we provide a characterization of conditional entropy in terms of this rule.

1. Introduction

The information loss K ( f ) associated with a measure-preserving function ( X , p ) f ( Y , q ) between finite probability spaces is given by the Shannon entropy difference
K ( f ) : = H ( p ) H ( q ) ,
where H ( p ) : = x X p x log p x is the Shannon entropy of p (and similarly for q). In [1], Baez, Fritz, and Leinster proved that the information loss satisfies, and is uniquely characterized up to a non-negative multiplicative factor by, the following conditions:
0.
Positivity: K ( f ) 0 for all ( X , p ) f ( Y , q ) . This says that the information loss associated with a deterministic process is always non-negative.
1.
Functoriality: K ( g f ) = K ( g ) + K ( f ) for every composable pair ( f , g ) of measure-preserving maps. This says that the information loss of two successive processes is the sum of the information losses associated with each process.
2.
Convex Linearity: K ( λ f ( 1 λ ) g ) = λ K ( f ) + ( 1 λ ) K ( g ) for all λ ( 0 , 1 ) . This says that the information loss associated with tossing a (possibly unfair) coin in deciding amongst two processes is the associated weighted sum of their information losses.
3.
Continuity: K ( f ) is a continuous function of f. This says that the information loss does not change much under small perturbations (i.e., is robust with respect to errors).
As measure-preserving functions may be viewed as deterministic stochastic maps, it is natural to ask whether there exist extensions of the Baez–Fritz–Leinster (BFL) characterization of information loss to maps that are inherently random (i.e., stochastic) in nature. In particular, what information-theoretic quantity captures such an information loss in this larger category?
This question is answered in the present work. Namely, we extend the BFL characterization theorem, which is valid on deterministic maps, to the larger category of stochastic maps. In doing so, we also find a characterization of the conditional entropy. Although the resulting extension is not functorial on the larger category of stochastic maps, we formalize a weakening of functoriality that restricts to functoriality on deterministic maps. This weaker notion of functoriality is definable in any Markov category [2,3], and it provides a key axiom in our characterization.
To explain how we arrive at our characterization, let us first recall the definition of stochastic maps between finite probability spaces, for which the measure-preserving functions are a special case. A stochastic map  ( X , p ) f ( Y , q ) associates with every x X a probability distribution f x on Y such that q y = x X f y x p x , where f y x is the distribution f x evaluated at y Y . In terms of information flow, the space ( X , p ) may be thought of as a probability distribution on the set of inputs for a communication channel described by the stochastic matrix f y x , while ( Y , q ) is then thought of as the induced distribution on the set of outputs of the channel.
Extending the information loss functor by assigning H ( p ) H ( q ) to any stochastic map ( X , p ) f ( Y , q ) would indeed result in an assignment that satisfies conditions 1–3 listed above. However, it would no longer be positive and the interpretation as an information loss would be gone. Furthermore, no additional information about the stochasticity of the map f would be used in determining this assignment. In order to guarantee positivity, an additional term, depending on the stochasticity of f, is needed. This term is provided by the conditional entropy of ( X , p ) f ( Y , q ) and is given by the the non-negative real number
H ( f | p ) : = x X p x H ( f x ) ,
where H ( f x ) is the Shannon entropy of the distribution f x on Y (in the case that ( X , p ) and ( Y , q ) are probability spaces associated with the alphabets of random variables X and Y , then H ( f | p ) coincides with conditional entropy H ( Y | X ) [4]). If ( X , p ) f ( Y , p ) is in fact deterministic, i.e., if f x is a point-mass distribution for all x X , then H ( f x ) = H ( f | p ) = 0 for all x X . As such, H ( f | p ) is a measure of the uncertainty (or randomness) of the outputs of f averaged over the prior distribution p on the set X of its inputs. Indeed, H ( f | p ) is maximized precisely when f x is the uniform distribution on Y for all x X .
Therefore, given a stochastic map ( X , p ) f ( Y , q ) , we call
K ( f ) : = H ( p ) H ( q ) + H ( f | p )
the conditional information loss of ( X , p ) f ( Y , q ) (the same letter K is used here because it agrees with the Shannon entropy difference when f is deterministic). As H ( f | p ) = 0 whenever f is deterministic, the conditional information loss restricts to the category of measure preserving functions as the information loss functor of Baez, Fritz, and Leinster, while also satisfying conditions 0, 2, and 3 (i.e., positivity, convex linearity, and continuity) on the larger category of stochastic maps. However, conditional information loss is not functorial in general, and while this may seem like a defect at first glance, we prove that there is no extension of the information loss functor that remains functorial on the larger category of stochastic maps if the positivity axiom is to be preserved, thus retaining an interpretation as information loss. In spite of this, conditional information loss does satisfy a weakened form of functoriality, which we briefly describe now.
A pair ( X , p ) f ( Y , q ) g ( Z , r ) of composable stochastic maps is a.e. coalescable if and only if for every pair of elements z Z and x X for which r z > 0 and p x > 0 , there exists a unique y Y such that f y x > 0 and g z y > 0 . Intuitively, this says that the information about the intermediate step can be recovered given knowledge about the input and output. In particular, if f is deterministic, then the pair ( f , g ) is a.e. colescable (for obvious reasons, since knowing x alone is enough to determine the intermediate value). However, there are other many situations where a pair could be a.e. coalescable and the maps need not be deterministic. With this definition in place (which we also generalize to the setting of arbitrary Markov categories), we replace functoriality with the following weaker condition.
1🟉.
Semi-functoriality: K ( g f ) = K ( g ) + K ( f ) for every a.e. coalescable pair ( X , p ) f ( Y , q ) g ( Z , r ) of stochastic maps. This says that the conditional information loss of two successive processes is the sum of the conditional information losses associated with each process provided that the information in the intermediate step can always be recovered.
Replacing functoriality with semi-functoriality is not enough to characterize the conditional information loss. However, it comes quite close, as only one more axiom is needed. Assuming positivity, semi-functoriality, convex linearity, and continuity, there are several equivalent axioms that may be stipulated to characterize the conditional information loss. To explain the first option, we introduce a convenient factorization of every stochastic map ( X , p ) f ( Y , q ) . The bloom-shriek factorization of f is given by the decomposition f = π Y ¡ f , where ¡ f : X X × Y is the bloom of f whose value at x is the probability measure on X × Y given by sending ( x , y ) to δ x x f y x , where δ x x is the Kronecker delta. In other words, ¡ f records each of the probability measures f x on a copy of Y indexed by x X . A visualization of the bloom of f is given in Figure 1a. When one is given the additional data of probability measures p and q on X and Y, respectively, then Figure 1b illustrates the bloom-shriek factorization of f. From this point of view, ¡ f keeps track of the information encoded in both p and f, while the projection map π Y forgets, or loses, some of this information.
With this in mind, our final axiom to characterize the conditional information loss is
4(a).
Reduction: K ( f ) = K ( π Y ) , where f = π Y ¡ f is the bloom-shriek factorization of f. This says that the conditional information loss of f equals the information loss of the projection using the associated joint distribution on X × Y .
Note that this axiom describes how K is determined by its action on an associated class of deterministic morphisms. These slightly modified axioms, namely, semi-functoriality, convex linearity, continuity, and reduction, characterize the conditional information loss and therefore extend Baez, Fritz, and Leinster’s characterization of information loss. A much simpler axiom that may be invoked in place of the reduction axiom which also characterizes conditional information loss is the following.
4(b).
Blooming: K ( ¡ p ) = 0 , where ¡ p is the unique map ( , 1 ) ( X , p ) from a one point probability space to ( X , p ) . This says that if a process begins with no prior information, then there is no information to be lost in the process.
The conditional entropy itself can be extracted from the conditional information loss by a process known as Bayesian inversion, which we now briefly recall. Given a stochastic map ( X , p ) f ( Y , q ) , there exists a stochastic map ( Y , q ) f ¯ ( Y , q ) such that f y x p x = f ¯ x y q y for all x X and y Y (the stochastic map f ¯ is the almost everywhere unique conditional probability so that Bayes’ rule holds). Such a map is called a Bayesian inverse of f. The Bayesian inverse can be visualized using the bloom-shriek factorization because it itself has a bloom-shriek factorization f ¯ = π X ¡ f ¯ . This is obtained by finding the stochastic maps in the opposite direction of the arrows so that they reproduce the appropriate volumes of the water droplets.
Given this perspective on Bayesian inversion, we prove that the conditional entropy of ( X , p ) f ( Y , q ) equals the conditional information loss of its Bayesian inverse ( Y , q ) f ¯ ( X , p ) . Moreover, since the conditional information loss of f ¯ is just the information loss of π X , this indicates how the conditional entropy and conditional information losses are the ordinary information losses associated with the two projections π X and π Y in Figure 1b. This duality also provides an interesting perspective on conditional entropy and its characterization. Indeed, using Bayesian inversion, we also characterize the conditional entropy as the unique assignment F sending measure-preserving stochastic maps between finite probability spaces to real numbers satisfying conditions 0, 1 🟉 , 2, and 3 above, but with a new axiom that reads as follows.
4(c).
Entropic Bayes’ Rule: F ( f ) + F ( ¡ p ) = F ( f ¯ ) + F ( ¡ q ) for all ( X , p ) f ( Y , q ) . This is an information theoretic analogue of Bayes’ rule, which reads f y x p x = f ¯ x y q y for all x X and y Y , or in more traditional probabilistic notation P ( y | x ) P ( x ) = P ( x | y ) P ( y ) .
In other words, we obtain a Bayesian characterization of the conditional entropy. This provides an entropic and information-theoretic description of Bayes’ rule from the Markov category perspective, in a way that we interpret as answering an open question of Fritz [6].

2. Categories of Stochastic Maps

In the first few sections, we define all the concepts involved in proving that the conditional information loss satisfies the properties that we will later prove characterize it. This section introduces the domain category and its convex structure.
Definition 1.
Let X and Y be finite sets. Astochastic map f : X Y associates a probability measure f x to every x X . If f : X Y is such that f x is a point-mass distribution for every x X , then f is said to bedeterministic.
Notation 1.
Given a stochastic map f : X Y (also written as X f Y ), the value f x ( y ) [ 0 , 1 ] will be denoted by f y x . As there exists a canonical bijection between deterministic maps of the form X Y and functions X Y , deterministic maps from X to Y will be denoted by the functional notation X Y .
Definition 2.
A stochastic map of the form X from a single element set to a finite set X is a single probability measure on X. Its unique value at x will be denoted by p x for all x X . The set N p : = { x X | p x = 0 } will be referred to as thenullspaceof p.
Definition 3.
LetFinStochbe the category of stochastic maps between finite sets. Given a finite set X, the identity map of X inFinStochcorresponds to the identity function id X : X X . Second, given stochastic maps f : X Y and g : Y Z , the composite g f : X Z is given by the Chapmann–Kolmogorov equation ( g f ) z x : = y Y g z y f y x .
Definition 4.
Let X be a finite set. Thecopyof X is the diagonal embedding Δ X : X X × X , and thediscardof X is the unique map from X to the terminal object • inFinStoch, which will be denoted by ! X : X . If Y is another finite set, theswap mapis the map γ : X × Y Y × X given by ( x , y ) ( y , x ) . Given morphisms f : X X and g : Y Y inFinStoch, theproductof f and g is the stochastic map f × g : X × Y X × Y given by ( f × g ) ( x , y ) ( x , y ) : = f x x g y y .
The product of stochastic maps endows FinStoch with the structure of a monoidal category. Together with the copy, discard, and swap maps, FinStoch is a Markov category [2,3].
Definition 5.
Let FinPS (this stands for “finiteprobabilities andstochastic maps”) be the co-slice category FinStoch , i.e., the category whose objects are pairs ( X , p ) consisting of a finite set X equipped with a probability measure p, and a morphism from ( X , p ) to ( Y , q ) is a stochastic map X f Y such that q y = x X f y x p x for all y Y . The subcategory of deterministic maps in FinPS will then be denoted by FinPD (which stands for “finiteprobabilities anddeterministic maps”). A pair ( f , g ) of morphisms in FinPS is said to be acomposable pairiff g f exists.
Note that the category FinPD was called FinProb in [1].
Remark 1.
Though it is often the case that we will denote a morphism ( X , p ) f ( Y , q ) in FinPS simply by f, such notation is potentially ambiguous, as the morphism ( X , p ) f ( Y , q ) is distinct from the morphsim ( X , p ) f ( Y , q ) whenever p p . As such, we will only employ the shorthand of denoting a morphism in FinPS by its underlying stochastic map whenever the source and target of the morphism are clear from the context.
Lemma 1.
The object ( , 1 ) given by a single element set equipped with the unique probability measure is a zero object (i.e., terminal and initial) in FinPS .
Definition 6.
Given an object ( X , p ) in FinPS , theshriekandbloomof p are the unique maps to and from ( , 1 ) respectively, which will be denoted ! p : ( X , p ) ( , 1 ) and ¡ p : ( , 1 ) ( X , p ) (the former is deterministic, while the latter is stochastic). The underlying stochastic maps associated with ! p and ¡ p are ! X : X and p : X , respectively.
Example 1.
Since ( , 1 ) is a zero object, given any two objects ( X , p ) and ( Y , q ) , there exists at least one morphism ( Y , q ) ( X , p ) , namely the composite ( Y , q ) ! q ( , 1 ) ¡ p ( X , p ) .
Definition 7.
Let ( X , p ) f ( Y , q ) be a morphism in FinPS . Thejoint distributionassociated with f is the probability measure ϑ ( f ) : X × Y given by ϑ ( f ) ( x , y ) = f y x p x .
It is possible to take convex combinations of both objects and morphisms in FinPS , and such assignments will play a role in our characterization of conditional entropy.
Definition 8.
Let p : X be a probability measure and let { ( Y x , q x ) } x X be a collection of objects in FinPS indexed by X. The p-weighted convex sum x X p x ( Y x , q x ) is defined to be the set x X Y x equipped with the probability measure x X p x q x given by
x X p x q x z : = p x q z x i f z Y x 0 otherwise .
In addition, if ( Y x , q x ) Q x ( Y x , q x ) is a collection of morphisms in FinPS indexed by X, the p-weighted convex sum x X p x Q x : x X Y x , x X p x q x x X Y x , x X p x q x is given by
x X p x Q x z z : = p x ( Q x ) z z if ( z , z ) Y x × Y x 0 otherwise .

3. The Baez–Fritz–Leinster Characterization of Information Loss

In [1], Baez, Fritz, and Leinster (BFL) characterized the Shannon entropy difference associated with measure-preserving functions between finite probability spaces as the only non-vanishing, continuous, convex linear functor from FinPD to the non-negative reals (up to a multiplicative constant). It is then natural to ask whether there exist either extensions or analogues of their result by including non-deterministic morphisms from the larger category FinPS . Before delving deeper into such inquiry, we first recall in detail the characterization theorem of BFL.
Definition 9.
Let B R be the convex category consisting of a single object and whose set of morphisms is R . The composition in B R is given by addition. Convex combinations of morphisms are given by ordinary convex combinations of numbers. The subcategory of non-negative reals will be denoted B R 0 .
In the rest of the paper, we will not necessarily assume that assignments from one category to another are functors. Nevertheless, we do assume they form (class) functions (see ([7], Section I.7) for more details). Furthermore, we assume that they respect or reflect source and targets in the following sense. If C and D are two categories, all functions F : C D are either covariant or contravariant in the sense that for any morphism a γ b in C , F ( γ ) is a morphism from F ( a ) to F ( b ) or from F ( b ) to F ( a ) , respectively. These are the only types of functions between categories we will consider in this work. As such, we therefore abuse terminology and use the term functions for such assignments throughout. If M is a commutative monoid and B M denotes its one object category, then every covariant function C B M is also contravariant and vice-versa.
We now define a notion of continuity for functions of the form F : FinPS B R .
Definition 10.
A sequence of morphisms ( X n , p n ) f n ( Y n , q n ) in FinPS convergesto a morphism ( X , p ) f ( Y , q ) if and only if the following two conditions hold.
(a) 
There exists an N N for which X n = X and Y n = Y for all n N .
(b) 
The following limits hold: lim n p n = p and lim n f n = f (note that these limits necessarily imply lim n q n = q ).
A function F : FinPS B R iscontinuousif and only if lim n F ( f n ) = F ( f ) whenever { f n } is a sequence in FinPS converging to f.
Remark 2.
In the subcategory FinPD , since the topology of the collection of functions from a finite set X to another finite set Y is discrete, one can equivalently assume that a sequence f n as in Definition 10, but this time with all f n deterministic, converges to ( X , p ) f ( Y , q ) if and only if the following two conditions hold.
(a) 
There exists an N N for which X n = X , Y n = Y , f n = f for all n N .
(b) 
For n N , one has lim n p n = p .
In this way, our definition of convergence agrees with the definition of convergence of BFL on the subcategory FinPD [1].
Definition 11.
A function F : FinPS B R is said to beconvex linearif and only if for all objects ( X , p ) in FinPS ,
F x X p x Q x = x X p x F ( Q x )
for all collections ( Y x , q x ) Q x ( Y x , q x ) x X in FinPS .
Definition 12.
A function F : FinPS B R is said to befunctorialif and only if it is in fact a functor, i.e., if and only if F ( g f ) = F ( f ) + F ( g ) for every composable pair ( f , g ) in FinPS .
Definition 13.
Let p : X be a probability measure. TheShannon entropyof p is given by
H ( p ) : = x X p x log ( p x ) .
When considering any entropic quantity, we will always adhere to the convention that 0 log ( 0 ) = 0 .
Definition 14.
Given a map ( X , p ) f ( Y , p ) in FinPD , the Shannon entropy difference H ( p ) H ( q ) will be referred to as theinformation lossof f. Information loss defines a functor K : FinPD B R , henceforth referred to as theinformation loss functoron FinPD .
Theorem 1
(Baez–Fritz–Leinster [1]). Suppose F : FinPD B R 0 is a function which satisfies the following conditions.
1. 
F is functorial.
2. 
F is convex linear.
3. 
F is continuous.
Then F is a non-negative multiple of information loss. Conversely, the information loss functor is non-negative and satisfies conditions 1–3.
In light of Theorem 1, it is natural to question whether or not there exists a functor K : FinPS B R 0 that restricts to FinPD as the information loss functor. It turns out that no such non-vanishing functor exists, as we prove in the following proposition.
Proposition 1.
If F : FinPS B R 0 is a functor, then F ( f ) = 0 for all morphisms f in FinPS .
Proof. 
Let ( X , p ) f ( Y , q ) be a morphism in FinPS . Since F is a functor,
F ( ¡ q ) = F ( f ¡ p ) = F ( f ) + F ( ¡ p ) 0 F ( f ) = F ( ¡ q ) F ( ¡ p ) .
Let ( Y , q ) g ( X , p ) be any morphism in FinPS (which necessarily exists by Example 1, for instance). Then a similar calculation yields
0 F ( g ) = F ( ¡ p ) F ( ¡ q ) = F ( f ) .
Hence, F ( f ) = 0 . □

4. Extending the Information Loss Functor

Proposition 1 shows it is not possible to extend the information loss functor to a functor on FinPS . Nevertheless, in this section, we define a non-vanishing function K : FinPS B R 0 that restricts to the information loss functor on FinPD , which we refer to as conditional information loss. While K is not functorial, we show that it satisfies many important properties such as continuity, convex linearity, and invariance with respect to compositions with isomorphisms. Furthermore, in Section 5 we show K is functorial on a restricted class of composable pairs of morphisms (cf. Definition 18), which are definable in any Markov category. At the end of this section we characterize conditional information loss as the unique extension of the information loss functor satisfying the reduction axiom 4(a) as stated in the introduction. In Section 8, we prove an intrinsic characterization theorem for K without reference to the deterministic subcategory FinPD inside FinPS . Appendix A provides an interpretation of the vanishing of conditional information loss in terms of correctable codes.
Definition 15.
Theconditional information lossof a morphism ( X , p ) f ( Y , q ) in FinPS is the real number given by
K ( f ) : = H ( p ) H ( q ) + H ( f | p ) ,
where
H ( f | p ) : = x X p x H ( f x )
is theconditional entropyof ( X , p ) f ( Y , q ) .
Proposition 2.
The function K : FinPS B R , uniquely determined on morphisms by sending ( X , p ) f ( Y , q ) to K ( f ) , satisfies the following conditions.
(i) 
K ( f ) 0 .
(ii) 
K restricted to FinPD agrees with the information loss functor (cf. Definition 14).
(iii) 
K is convex linear.
(iv) 
K is continuous.
(v) 
Given ( X , p ) f ( Y , q ) , then K ( f ) = K ( π Y ) , where ( X × Y , ϑ ( f ) ) π Y ( Y , q ) is the projection and ϑ ( f ) is the joint distribution (cf. Definition 7).
Lemma 2.
Let ( X , p ) f ( Y , q ) be a morphism in FinPS . Then
K ( f ) = x X \ N p y Y \ N f x f y x p x log f y x p x q y .
Proof of Lemma 2.
Applying K to f yields
K ( f ) = x , y f y x p x log ( f y x ) x p x log ( p x ) + y q y log ( q y ) = x , y f y x p x log ( f y x ) x y f y x p x log ( p x ) + y x f y x p x log ( q y ) = x X \ N p y Y \ N f x f y x p x log f y x p x q y .  
Proof of Proposition 2.
(i)
The non-negativity of K follows from Lemma 2 and the equality q y = x X f y x p x f y x p x .
(ii)
This follows from the fact that H ( f | p ) = 0 for all deterministic f.
(iii)
Let p : X be a probability measure, and let ( Y x , q x ) Q x ( Y x , q x ) be a collection of morphisms in FinPS indexed by X. Then the p-weighted convex sum x X p x Q x is a morphism in FinPS of the form ( Z , r ) h ( Z , r ) , where Z : = x X Y x , Z : = x X Y x , h : = x X p x Q x , r : = x X p x q x , and r : = x X p x q x . Then
K ( h ) = z Z r z log ( r z ) + z Z r z log ( r z ) z Z z Z r z h z z log ( h z z ) = x X y x Y x y x Y x p x q y x x Q y x y x x log Q y x y x x = x X p x H Q x | q x ,
which shows that K is convex linear.
(iv)
Let X ( n ) , p ( n ) f ( n ) Y ( n ) , q ( n ) be a sequence (indexed by n N ) of probability-preserving stochastic maps such that X ( n ) = X and Y ( n ) = Y for large enough n, and where lim n f ( n ) = f , lim n p ( n ) = p , and lim n q ( n ) = q . Then
lim n K f ( n ) = lim n x , y f y x ( n ) p x ( n ) log f y x ( n ) p x ( n ) x f y x ( n ) p x ( n ) = K ( f ) ,
where the last equality follows from the fact that the limit and sum (which is finite) can be interchanged and all expressions are continuous on [ 0 , 1 ] .
(v)
This follows from
H ϑ ( f ) = x X y Y f y x p x log ( f y x p x ) = x X y Y f y x p x log ( f y x ) y Y f y x = 1 x X p x log ( p x ) = H ( f | p ) + H ( p )
and the fact that K ( π Y ) = H ϑ ( f ) H ( q ) . □
Remark 3.
Since conditional entropy vanishes for deterministic morphisms, conditional information loss restricts to FinPD as the information loss functor. It is important to note that if the term H ( f | p ) was not included in the expression for K ( f ) , then the inequality K ( f ) 0 would fail in general. When f is deterministic, Baez, Fritz, and Leinster proved H ( p ) H ( q ) 0 . However, when f is stochastic, the inequality H ( p ) H ( q ) 0 does not hold in general. This has to do with the fact that stochastic maps may increase entropy, whereas deterministic maps always decrease it(while this claim holds in the classical setting as stated, it no longer holds for quantum systems [8]). As such, the term H ( f | p ) is needed to retain non-negativity as one attempts to extend BFL’s functor K on FinPD to a function on FinPS .
Item (v) of Proposition 2 says that the conditional information loss of a map ( X , p ) f ( Y , q ) in FinPS is the information loss of the deterministic map ( X × Y , ϑ ( f ) ) π Y ( Y , q ) in FinPD , so that conditional information loss of a morphism in FinPS may always be reduced to the information loss of a deterministic map in FinPD naturally associated with it having the same target. This motivates the following definition.
Definition 16.
A function F : FinPS B R isreductiveif and only if F ( f ) = F ( π Y ) for every morphism ( X , p ) f ( Y , q ) in FinPS (cf. Proposition 2 item (v) for notation).
Proposition 3
(Reductive characterization of conditional information loss). Let F : FinPS B R 0 be a function satisfying the following conditions.
(i) 
F restricted to FinPD is functorial, convex linear, and continuous.
(ii) 
F is reductive.
Then F is a non-negative multiple of conditional information loss. Conversely, conditional information loss satisfies conditions (i) and (ii).
Proof. 
This follows immediately from Theorem 1 and item (v) of Proposition 2. □
In what follows, we will characterize conditional information loss without any explicit reference to the subcatgeory FinPD or the information loss functor of Baez, Fritz, and Leinster. To do this, we first need to develop some machinery.

5. Coalescable Morphisms and Semi-Functoriality

While conditional information loss is not functorial on FinPS , we know it acts functorially on deterministic maps. As such, it is natural to ask for which pairs of composable stochastic maps does the conditional information loss act functorially. In this section, we answer this question, and then we use our result to define a property of functions FinPS B R that is a weakening of functoriality, and which we refer to as semi-functoriality. Our definitions are valid in any Markov category (cf. Appendix B).
Definition 17.
A deterministic map Z × X h Y is said to be amediatorfor the composable pair ( X , p ) f ( Y , q ) g ( Z , r ) in FinPS if and only if
( g f ) z x = g z h ( z , x ) f h ( z , x ) x for all ( z , x ) Z × ( X \ N p ) .
If in fact Equation (1) holds for all ( z , x ) Z × X , then h is said to be astrong mediatorfor the composable pair X f Y g Z inFinStoch.
Remark 4.
Mediators do not exist for general composable pairs, as one can see by considering any composable pair ( , 1 ) p ( X , p ) f ( Y , q ) such that H ( ϑ ( f ) ) H ( q ) (cf. Definitions 7 and 13).
Proposition 4.
Let ( X , p ) f ( Y , q ) g ( Z , r ) be a composable pair of morphisms in FinPS . Then the following statements are equivalent.
(a) 
For every x X \ N p and z Z , there exists at most one y Y such that g z y f y x 0 .
(b) 
The pair ( f , g ) admits a mediator Z × X h Y .
(c) 
There exists a function Z × X h Y such that
h y ( z , x ) ( g f ) z x p x = g y z f y x p x ( z , y , x ) Z × Y × X .
Proof. 
((a)⇒(b)) For every ( z , x ) Z × ( X \ N p ) for which such a y exists, set h ( z , x ) : = y . If no such y exists or if x N p , set h ( z , x ) to be anything. Then h is a mediator for ( f , g ) .
((b)⇒(c)) Let h be a mediator for ( f , g ) . Since (2) holds automatically for x N p , suppose x X \ N p , in which case (2) is equivalent to h y ( z , x ) ( g f ) z x = g y z f y x for all ( z , y ) Z × Y . This follows from Equation (1) and the fact that h is a function.
((c)⇒(a)) Let ( z , x ) Z × ( X \ N p ) and suppose ( g f ) z x > 0 . If h is the mediator, then y Y g z y f y x = ( g f ) z x = g z h ( z , x ) f h ( z , x ) x . But since g z y f y x = 0 for all y h ( z , x ) , there is only one non-vanishing term in this sum, and it is precisely g z h ( z , x ) f h ( z , x ) x . □
Theorem 2
(Functoriality of Conditional Entropy). Let ( X , p ) f ( Y , q ) g ( Z , r ) be a composable pair of morphisms in FinPS . Then
H ( g f | p ) = H ( f | p ) + H ( g | q )
holds if and only if there exists a mediator Z × X h Y for ( X , p ) f ( Y , q ) g ( Z , r ) .
We first prove two lemmas.
Lemma 3.
Let ( X , p ) f ( Y , q ) g ( Z , r ) be a pair of composable morphisms. Then
H ( g × id Y ) Δ Y f | p = H ( g | q ) + H ( f | p ) .
In particular, H ( g f | p ) = H ( g | q ) + H ( f | p ) if and only if H ( g f | p ) = H ( g × id Y ) Δ Y f | p .
Proof of Lemma 3.
On components, ( g × id Y ) Δ Y f ( z , y ) x = g z y f y x . Hence,
H ( g × id Y ) Δ Y f | p = x p x y , z g z y f y x log ( g z y f y x ) = x , y , z p x g z y f y x log ( g z y ) x , y , z p x g z y f y x log ( f y x ) = y , z q y g z y f y x log ( g z y ) x , y p x f y x log ( f y x ) = H ( g | q ) + H ( f | p ) .
Note that this equality still holds if g z y = 0 or f y x = 0 as each step in this calculation accounted for such possibilities. □
Lemma 4.
Let ( X , p ) f ( Y , q ) g ( Z , r ) be a pair of composable morphisms in FinPS . Then
0 H ( g × id Y ) Δ Y f | p H ( g f | p ) = x X \ N p y Y \ N f x z Z \ N g y p x g z y f y x log g z y f y x y g z y f y x .
Note that the order of the sums matters in this expression and also note that it is always well-defined since g z y f y x 0 implies ( g f ) z x 0 .
Proof of Lemma 4.
For convenience, temporarily set : = H ( g × id Y ) Δ Y f | p H ( g f | p ) . Then
= x , y , z p x g z y f y x log ( g z y f y x ) + x , y , z p x g z y f y x log ( g f ) z x = x X \ N p y Y \ N f x z Z \ N g y p x g z y f y x log g z y f y x ( g f ) z x ,
which proves the claim due to the definition of the composition of stochastic maps. □
Proof of Theorem 2.
Temporarily set : = H ( g × id Y ) Δ Y f | p H ( g f | p ) . In addition, note that the set of all x X \ N p and z Z \ N ( g f ) x can be given a more explicit description in terms of the joint distribution s : = γ ϑ ( g f ) Z × X associated with the composite g f and prior p, namely s ( z , x ) : = ( g f ) z x p x . Then,
( z , x ) : x X \ N p , z Z \ N ( g f ) x = ( Z × X ) \ N s .
(⇒) Suppose = 0 , which is equivalent to Equation (3) by Lemma 3. Then since each term in the sum from Lemma 4 is non-negative,
0 = g z y f y x p x log g z y f y x y g z y f y x x X \ N p , y Y \ N f x , z Z N g y .
Hence, fix such an x X \ N p , y Y \ N f x , z Z N g y . The expression here vanishes if and only if
g z y f y x = ( g f ) z x , i . e . , g z y f y x = 0 y Y \ { y } .
Hence, for every x X \ N p and z Z \ N ( g f ) x , there exists a unique y Y such that g z y f y x 0 . But by (5), this means that for every ( z , x ) ( Z × X ) \ N s , there exists a unique y Y such that g z y f y x 0 . This defines a function ( Z × X ) \ N s Y \ N q which can be extended in an s-a.e. unique manner to a function Z × X h Y .
We now show the function h is in fact a mediator for the composable pair ( g , f ) . The equality clearly holds if x N p since both sides vanish. Hence, suppose that x X \ N p . Given y Y , z Z , the left-hand-side of (2) is given by
δ y h ( z , x ) ( g f ) z x p x = g z y f y x p x if   z Z \ N ( g f ) x and   y = h ( z , x ) 0 otherwise .
by Equation (6). Similarly, if x X \ N p and z N ( g f ) x , then g z y f y x = 0 for all y Y because otherwise ( g f ) z x p x would be nonzero. If instead z Z \ N ( g f ) x , then g z h ( z , x ) f h ( z , x ) x 0 and g z y f y x = 0 for all y Y \ { h ( z , x ) } by (6). Therefore, (2) holds.
(⇐) Conversely, suppose a mediator h exists and let X k Z × Y be the stochastic map given on components by k ( z , y ) x : = h y ( z , x ) ( g f ) z x . Then
H ( g | q ) + H ( f | p ) = H ( g × id Y ) Δ Y f | p by Lemma 3 = H ( k | p ) by Proposition 4 item ( c ) = x X \ N p z Z \ N ( g f ) x y Y h y ( z , x ) ( g f ) z x p x log h y ( z , x ) ( g f ) z x = ( z , x ) ( Z × X ) \ N s y Y δ y h ( z , x ) ( g f ) z x p x log δ y h ( z , x ) ( g f ) z x = ( z , x ) ( Z × X ) \ N s ( g f ) z x p x log ( g f ) z x = H ( g f ) | p ,
as desired. □
Corollary 1
(Functoriality of Conditional Information Loss). Let ( X , p ) f ( Y , q ) g ( Z , r ) be a composable pair of morphisms in FinPS . Then K ( g f ) = K ( f ) + K ( g ) if and only if there exists a mediator Z × X h Y for the pair ( X , p ) f ( Y , q ) g ( Z , r ) .
Proof. 
Since the Shannon entropy difference is always functorial, the conditional information loss is functorial on a pair of morphisms if and only if the conditional entropy is functorial on that pair. Theorem 2 then completes the proof. □
Example 2.
In the notation of Theorem 2, suppose that f isa.e. deterministic, which means f y x = δ y f ( x ) for all x X \ N p for some function f (abusive notation is used). In this case, the deviation from functoriality, (4), simplifies to
H ( g × id Y ) Δ Y f | p H ( g f | p ) = x X \ N p z Z p x g z f ( x ) log g z f ( x ) g z f ( x ) = 0 .
Therefore, if f is p-a.e. deterministic, H ( g | q ) + H ( f | p ) = H ( g f | p ) . In this case, the mediator Z × X h Y is given by h : = ! Z × f .
Definition 18.
A pair ( X , p ) f ( Y , q ) g ( Z , r ) of composable morphisms in FinPS is calleda.e. coalescableif and only if ( X , p ) f ( Y , q ) g ( Z , r ) admits a mediator Z × X h Y . Similarly, a pair X f Y g Z of composable morphisms inFinStochis calledcoalescableiff X f Y g Z admits a strong mediator Z × X h Y .
Remark 5.
Example 2 showed that if ( X , p ) f ( Y , q ) is p-a.e. deterministic, then the pair ( X , p ) f ( Y , q ) g ( Z , r ) is a.e. coalescable for any g. In particular, every pair of composable morphisms in FinPD is coalescable.
In light of Theorem 2 and Corollary 1, we make the following definition, which will serve as one of the axioms in our later characterizions of both conditional information loss and conditional entropy.
Definition 19.
A function F : FinPS B R is said to besemi-functorialiff F ( g f ) = F ( g ) + F ( f ) for every a.e. coalescable pair ( X , p ) f ( Y , q ) g ( Z , r ) in FinPS .
Example 3.
By Theorem 2 and Corollary 1, conditional information loss and conditional entropy are both semi-functorial.
Proposition 5.
Suppose F : FinPS B R is semi-functorial. Then the restriction of F to FinPD is functorial. In particular, if F is, in addition, convex linear, continuous, and reductive, then F is a non-negative multiple of conditional information loss.
Proof. 
By Example 2, every pair of composable morphisms in FinPD is a.e. coalescable. Therefore, F is functorial on FinPD . The second claim then follows from Proposition 3. □
The following lemma will be used in later sections and serves to illustrate some examples of a.e. coalescable pairs.
Lemma 5.
Let ( W , s ) e ( X , p ) f ( Y , q ) g ( Z , r ) be a triple of composable morphisms with e deterministic and g invertible. Then each of the following pairs are a.e. coalescable:
(i) 
( W , s ) e ( X , p ) f ( Y , q )
(ii) 
( X , p ) f ( Y , q ) g ( Z , r )
(iii) 
( W , s ) e ( X , p ) g f ( Z , r )
(iv) 
( W , s ) f e ( Y , q ) g ( Z , r )
Proof. 
The proof that ( W , s ) e ( X , p ) f ( Y , q ) is coalescable was provided (in a stronger form) in Example 2. To see that ( X , p ) f ( Y , q ) g ( Z , r ) is coalescable, note that since g is an isomorphism we have ( g f ) z x = g z g 1 ( z ) f g 1 ( z ) x . Thus, Z × X g 1 × ! X Y × Y is a mediator function for g f , thus g f is coalescable. The last two claims follow from the proofs of the first two claims. □

6. Bayesian Inversion

In this section, we recall the concepts of a.e. equivalence and Bayesian inversion phrased in a categorical manner [2,3,9], as they will play a significant role moving forward.
Definition 20.
Let ( X , p ) f ( Y , q ) and ( X , p ) g ( Y , q ) be two morphisms in FinPS with the same source and target. Then f and g are said toalmost everywhere equivalent(or p-a.e.equivalent) if and only if f y x = g y x for every x X with p x 0 . In such a case, the p-a.e. equivalence of f and g will be denoted f = p g .
Theorem 3
(Bayesian Inversion [2,9,10]). Let ( X , p ) f ( Y , q ) be a morphism in FinPS . Then there exists a morphism ( Y , q ) f ¯ ( X , p ) such that f ¯ x y q y = f y x p x for all x X and y Y . Furthermore, for any other morphism ( Y , q ) f ¯ ( X , p ) satisfying this condition, f ¯ = q f ¯ .
Definition 21.
The morphism ( Y , q ) f ¯ ( X , p ) appearing in Theorem 3 will be referred to as aBayesian inverseof ( X , p ) f ( Y , q ) . It follows that f ¯ x y = p x f y x / q y for all y Y with q y 0 .
Proposition 6.
Bayesian inversion satisfies the following properties.
(i) 
Suppose ( X , p ) f ( Y , q ) and ( X , p ) g ( Y , q ) are p-a.e. equivalent, and let f ¯ and g ¯ be Bayesian inverses of f and g, respectively. Then f ¯ = q g ¯ .
(ii) 
Given two morphisms ( X , p ) f ( Y , q ) and ( Y , q ) g ( X , p ) in FinPS , then f is a Bayesian inverse of g if and only if g is a Bayesian inverse of f.
(iii) 
Let ( Y , q ) f ¯ ( X , p ) be a Bayesian inverse of ( X , p ) f ( Y , q ) , and let γ : X × Y Y × X be the swap map (as in Definition 4). Then ϑ ( f ) = γ ϑ ( f ¯ )
(iv) 
Let ( f , g ) be a composable pair of morphisms in FinPS , and suppose f ¯ and g ¯ are Bayesian inverses of f and g respectively. Then ( g ¯ , f ¯ ) is a composable pair, and f ¯ g ¯ is a Bayesian inverse of g f .
Proof. 
These are immediate consequences of the categorical definition of a Bayesian inverse (see [3,10,11] for proofs). □
Definition 22.
A contravariant function B : FinPS FinPS is said to be aBayesian inversion functorif and only if B acts as the identity on objects and B ( f ) is a Bayesian inverse of f for all morphisms f in FinPS .
This is mildly abusive terminology since functoriality only holds in the a.e. sense, as explained in the following remark.
Remark 6.
A Bayesian inversion functor exists. Given any ( X , p ) f ( Y , q ) , set Y f ¯ X to be given by f ¯ x y = p x f y x / q y for all y Y with q y 0 and f ¯ x y = 1 / | X | for all y Y with q y = 0 . Note that this does not define a functor. Indeed, if ( X , p ) is a probability space with p x 0 = 0 for some x 0 X , then ( id X ¯ ) x 0 is the uniform measure on X instead of the Dirac delta measure concentrated on x 0 . In other words, id X ¯ id X . Similar issues of measure zero occur, indicating that g f ¯ f ¯ g ¯ for a composable pair of morphisms ( X , p ) f ( Y , q ) g ( Z , r ) . Nevertheless, Bayesian inversion is a.e. functorial in the sense that g f ¯ = r f ¯ g ¯ and id ( X , p ) ¯ = p id ( X , p ) .
Corollary 2.
B 2 ( f ) = p f for any Bayesian inversion functor B and every ( X , p ) f ( Y , q ) in FinPS .
Proposition 7.
Let B be a Bayesian inversion functor on FinPS (as in Definition 22). Then B isa.e. convex linearin the sense that
B x X p x Q x = q x X p x B ( Q x ) ,
where q : = x X p x q x and the other notation is as in Definition 8.
Proof. 
First note that it is immediate that B is convex linear on objects since Bayesian inversion acts as the identity on objects. Let p : X be a probability measure, ( Y x , q x ) Q x ( Y x , q x ) be a collection of morphisms in FinPS indexed by X, and suppose B is a Bayesian inversion functor. Then for ( z , z ) Y x × Y x with p x q z x 0 , we have
B x X p x Q x z z = p x q z x x X p x Q x z z p x q z x = q z x p x Q z z x q z x = p x q z x Q z z x q z x = p x B ( Q x ) z z = x X p x B ( Q x ) z z .
Thus, B is a.e. convex linear. □
Proposition 8.
Given ( X , p ) f ( Y , q ) g ( Z , r ) in FinPS , and let f ¯ and g ¯ be Bayesian inverses of f and g respectively. Then ( f , g ) is a.e. coalescable if and only if ( g ¯ , f ¯ ) is a.e. coalescable.
Proof. 
Since Bayesian inversion is a dagger functor on a.e. equivalence classes ([3], (Remark 13.10)), it suffices to prove one direction in this claim. Hence, suppose ( f , g ) is a.e. coalescable and let h be a mediator function realizing this. Then h γ is a mediator for ( g ¯ , f ¯ ) because
f ¯ x y g ¯ y z r z = f ¯ x y g z y q y = g z y f y x p x = h y ( z , x ) ( g f ) z x p x = h y ( z , x ) ( g f ) ¯ x z r z = ( h γ ) y ( x , z ) ( g f ) ¯ x z r z .
A completely string-diagrammatic proof is provided in Appendix B. □
The following proposition is a reformulation of the conditional entropy identity H ( Y | X ) + H ( X ) = H ( X | Y ) + H ( Y ) in terms of Bayesian inversion.
Proposition 9.
Let ( X , p ) f ( Y , q ) be a morphism in FinPS , and suppose f ¯ is a Bayesian inverse of f. Then
H ( f | p ) + H ( p ) = H ( f ¯ | q ) + H ( q ) .
Proof. 
This follows from the fact that both sides of (7) are equal to H ( ϑ ( f ) ) . □
Proposition 9 implies Bayesian inversion takes conditional entropy to conditional information loss and vice versa, which is formally stated as follows.
Corollary 3.
Let K : FinPS B R 0 and H : FinPS B R 0 be given by conditional information loss and conditional entropy, respectively, and let B : FinPS FinPS be a Bayesian inversion functor. Then, H = K B and K = H B .
Remark 7.
If ( X , p ) f ( Y , q ) is a deterministic morphism in FinPS , Baez, Fritz, and Leinster point out that the information loss of f is in fact the conditional entropy of x given y [1]. Here, we see this duality as a special case of Corollary 3 applied to deterministic morphisms.

7. Bloom-Shriek Factorization

We now introduce a simple, but surprisingly useful, factorization for every morphism in FinPS , and we use it to prove some essential lemmas for our characterization theorems for conditional information loss and conditional entropy, which appear in the following sections.
Definition 23.
Given a stochastic map X f Y , thebloom of fis the stochastic map X ¡ f X × Y given by the composite X Δ X X × X id X × f X × Y , and theshriek of fis the deterministic map X × Y ! f X given by the projection π X .
Proposition 10.
Let ( X , p ) f ( Y , q ) be a morphism in FinPS . Then the following statement hold.
(i) 
The composite ( X , p ) ¡ f ( X × Y , ϑ ( f ) ) ! f ( X , p ) is equal to the identity i d X .
(ii) 
The morphism f equals the composite ( X , p ) ¡ f ( X × Y , ϑ ( f ) ) ! f ¯ γ ( Y , q ) , where f ¯ denotes any Bayesian inverse of f and γ : X × Y Y × X is the swap map.
(iii) 
The pair ( X , p ) ¡ f ( X × Y , ϑ ( f ) ) ! f ¯ γ π Y ( Y , q ) is coalescable.
Definition 24.
The decomposition in item (ii) Proposition 10 will be referred to as thebloom-shriek factorizationof f.
Proof of Propostion 10.
Element-wise proofs are left as exercises. Appendix B contains an abstract proof using string diagrams in Markov categories. □
The bloom of f can be expressed as a convex combination of simpler morphisms up to isomorphism. To describe this and its behavior under convex linear semi-functors, we introduce the notion of an invariant and examine some of its properties.
Definition 25.
A function F : FinPS B R is said to be aninvariantif and only if for every triple of composable morphisms ( W , s ) e ( X , p ) f ( Y , q ) g ( Z , r ) such that e and g are isomorphisms, then F ( f ) = F ( g f e ) .
Lemma 6.
If a function F : FinPS B R 0 is semi-functorial, then F is an invariant.
Proof. 
Consider a composable triple ( W , s ) e ( X , p ) f ( Y , q ) g ( Z , r ) such that e and g are isomorphisms. Then
F g f e = F g ( f e ) = F ( g ) + F ( f e ) = F ( g ) + F ( f ) + F ( e )
by Lemma 5. Secondly, since g and e are isomorphisms, and since the pairs ( g , g 1 ) and ( e , e 1 ) are coalescable, F ( id ) = F ( g g 1 ) = F ( g ) + F ( g 1 ) . But since F ( id ) = 0 (by semi-functoriality), this requires that F ( g ) = 0 for an isomorphism g since F ( g ) 0 and F ( g 1 ) 0 . The same is true for F ( e ) . Hence, F g f e = F ( f ) . □
Lemma 7.
Let ( X , p ) f ( Y , q ) be a morphism in FinPS , and suppose F : FinPS B R 0 is semi-functorial and convex linear. Then the following statements hold.
(i) 
F ( f ) = F ( ! f ¯ ) + F ( ¡ f )
(ii) 
F ( ¡ f ) = x X p x F ( ¡ f x )
(iii) 
F ( ! f ) = x X p x F ( ! f x )
Proof. 
For item (i), we have
F ( f ) = F ( ! f ¯ γ ¡ f ) by item ( ii ) of Proposition 10 = F ( ! f ¯ γ ) + F ( ¡ f ) by item ( iii ) of Proposition 10 = F ( ! f ¯ ) + F ( ¡ f ) by Lemma 6 .
For items (ii) and (iii), note that ¡ f and ! f can be expressed as composites of isomorphisms and certain convex combinations, namely
Entropy 23 01021 i001
Hence,
F ( ¡ f ) = F ( φ ψ ) F ( ! f ) = F ( φ ψ ) by ( 8 ) = F ( ) = F ( ) by Lemma 6 = x X p x F ( ¡ f x ) = x X p x F ( ! f x ) since F is convex linear .  
Proposition 11.
Suppose F : FinPS B R 0 is semi-functorial and convex linear. If f , g : ( X , p ) ( Y , q ) are two morphisms in FinPS such that f = p g , then F ( f ) = F ( g ) .
Proof. 
Suppose ( X , p ) f ( Y , q ) and ( X , p ) g ( Y , q ) are such that f = p g , and let f ¯ and g ¯ be Bayesian inverses for f and g. Then
F ( f ) = F ( ! f ¯ ) + F ( ¡ f ) by item ( i ) of Lemma 7 = y Y q y F ( ! f ¯ y ) + x X p x F ( ¡ f x ) by items ( ii ) and ( iii ) of Lemma 7 = y Y q y F ( ! g ¯ y ) + x X p x F ( ¡ g x ) since f = p g and f ¯ = q g ¯ = F ( ! g ¯ ) + F ( ¡ g ) by items ( ii ) and ( iii ) of Lemma 7 = F ( g ) by item ( i ) of Lemma 7 ,
as desired. □

8. An Intrinsic Characterization of Conditional Information Loss

Theorem 4.
Suppose F : FinPS B R 0 is a function satisfying the following conditions.
1. 
F is semi-functorial.
2. 
F is convex linear.
3. 
F is continuous.
4. 
F ( ¡ p ) = 0 for every probability distribution p X .
Then F is a non-negative multiple of conditional information loss. Conversely, conditional information loss satisfies conditions 1–4.
Proof. 
Suppose F satisfies conditions 1–4, let ( X , p ) f ( Y , q ) be an arbitrary morphism in FinPS , and let γ : Y × X X × Y be the swap map, so that ! f ¯ = π Y γ . Then
F ( f ) = F ( ! f ¯ ) + F ( ¡ f ) by item ( i ) of Lemma 7 = F ( ! f ¯ ) + x X p x F ( ¡ f x ) by item ( ii ) of Lemma 7 = F ( ! f ¯ ) by condition 4 = F π Y γ = F ( π Y ) since F is an invariant by Lemma 6 .
Thus, F is reductive (see Definition 16) and Proposition 5 applies. □
Remark 8.
Under the assumption that F : FinPS B R is semi-functorial and convex linear, one may show F satisfies condition 4 in Theorem 4 if and only if F is reductive (see Definition 16 and Proposition 5). While the reductive axiom specifies how the semi-functor acts on all morphisms in FinPS , condition 4 in Theorem 4 only specifies how it acts on morphisms from the initial object. This gives not just a simple mathematical criterion, but one with a simple intuitive interpretation as well. Namely, condition 4 says that if a process begins with no prior information, then there is no information to be lost in the process.
We now use Theorem 4 and Bayesian inversion to prove a statement dual to Theorem 4.
Theorem 5.
Suppose F : FinPS B R 0 is a function satisfying the following conditions.
1. 
F is semi-functorial.
2. 
F is convex linear.
3. 
F is continuous.
4. 
F ( ! p ) = 0 for every probability distribution p X .
Then F is a non-negative multiple of conditional entropy. Conversely, conditional entropy satisfies conditions 1–4.
Before giving a proof, we introduce some terminology and prove a few lemmas. We also would like to point out that condition 4 may be given an operational interpretation as follows: if a communication channel has a constant output, then it has no conditional entropy.
Definition 26.
Let F : FinPS B R be a function and let B be a Bayesian inversion functor. Then F ¯ : = F B will be referred to as aBayesian reflectionof F.
Remark 9.
By Proposition 11, if F : FinPS B R is a convex linear semi-functor, then a Bayesian reflection is independent of the choice of a Bayesian inversion functor, and as such, is necessarily unique.
Lemma 8.
Let ( X , p ) f ( Y , q ) be a morphism in FinPS , suppose F : FinPS B R 0 is a convex linear semi-functor, and let f ¯ be a Bayesian inverse of f. Then F ¯ ( f ) = F ( f ¯ ) .
Proof of Lemma 8.
Let B be a Bayesian inversion functor, so that B ( f ) = q f ¯ . Then F ¯ ( f ) = F ( B ( f ) ) = F ( f ¯ ) , where the last equality follows from Proposition 11. □
Lemma 9.
Let B : FinPS FinPS be a Bayesian inversion functor and let ( X n , p n ) f n ( Y n , q n ) be a sequence of morphisms in FinPS converging to ( X , p ) f ( Y , q ) . Then lim n B ( f n ) = q B ( f ) .
Proof of Lemma 9.
Set f ( n ) : = f n . For all y Y with q y 0 , we have
lim n B f ( n ) x y = lim n p x ( n ) f y x ( n ) q y ( n ) = p x f y x q y = B ( f ) x y .  
Lemma 10.
Suppose F : FinPS B R 0 is a function satisfying conditions 1–4 of Theorem 5. Then the Bayesian reflection F ¯ is a non-negative multiple of conditional information loss.
Proof of Lemma 10.
We show F ¯ satisfies conditions 1–4 of Theorem 4. Throughout the proof, let B denote a Bayesian inversion functor, so that F ¯ = F B .
Semi-functoriality: Suppose ( X , p ) f ( Y , q ) g ( Z , r ) is an a.e. coalescable pair of composable morphisms in FinPS . Then
F ¯ ( g f ) = F g f ¯ by Lemma 8 = F f ¯ g ¯ by item ( iv ) of Proposition 6 = F ( f ¯ ) + F ( g ¯ ) by Proposition 8 = F ¯ ( f ) + F ¯ ( g ) by Lemma 8 .
Thus, F ¯ is semi-functorial.
Convex Linearity: Given any probability space ( X , p ) and a family of morphisms ( Y x , q x ) Q x ( Y x , q x ) in FinPS indexed by X,
F ¯ x X p x Q x = F x X p x B ( Q x ) by Propositions 7 and 11 = x X p x F B ( Q x ) sin ce F is convex linear = x X F ¯ ( Q x ) by definition of F ¯ .
Thus, F ¯ is convex linear.
Continuity: This follows from Lemma 9 and Proposition 11.
F ¯ ( ¡ p ) = 0 for every probability distribution p X : This follows from Lemma 8, since ! p is the unique Bayesian inverse of ¡ p . □
Proof of Theorem 5.
Suppose F : FinPS B R is a function satisfying conditions 1–4 of Theorem 5, and let B be a Bayesian inversion functor. Since F is semi-functorial and convex linear it follows from Proposition 11 that F = F ¯ B , and by Lemma 10 it follows that F ¯ = c K for some non-negative constant c 0 . We then have F = F ¯ B = c K B = c H . Thus, F is a non-negative multiple of conditional entropy. □

9. A Bayesian Characterization of Conditional Entropy

We now prove a reformulation of Theorem 5, where condition 4 is replaced by a condition that we view as an ‘entropic Bayes’ rule’.
Definition 27.
A function F : FinPS B R satisfies anentropic Bayes’ ruleif and only if
F ( f ) + F ( ¡ p ) = F ( f ¯ ) + F ( ¡ q )
for every morphism ( X , p ) f ( Y , q ) in FinPS and any Bayesian inverse f ¯ of f.
Remark 10.
The entropic Bayes’ rule is an abstraction of the conditional entropy identity 7.
Theorem 6
(A Bayesian characterization of conditional entropy). Suppose F : FinPS B R 0 is a function satisfying the following conditions.
1. 
F is semi-functorial.
2. 
F is convex linear.
3. 
F is continuous.
4. 
F satisfies an entropic Bayes’ rule.
Then F is a non-negative multiple of conditional entropy. Conversely, conditional entropy satisfies conditions 1–4.
Proof. 
By Theorem 5, it suffices to show F ( ! p ) = 0 for every object ( X , p ) in FinPS . For this, first note that ¡ 1 = ¡ 1 ¡ 1 , where ( , 1 ) is the point-mass distribution on a single point. Since F is semi-functorial and ¡ 1 ¡ 1 is coalescable, we have F ( ¡ 1 ) = F ( ¡ 1 ) + F ( ¡ 1 ) , which implies F ( ¡ 1 ) = 0 . Applying the entropic Bayes’ rule from Definition 27 to the morphism ! p : ( X , p ) ( , 1 ) yields
F ( ! p ) + F ( ¡ p ) = F ( ¡ p ) + F ( ¡ 1 ) F ( ! p ) = F ( ¡ 1 ) = 0 ,
as desired. □
Remark 11.
In ([6], slide 21), Fritz asked if there is a Markov category for information theory explaining the analogy between Bayes’ rule P ( A | B ) P ( B ) = P ( B | A ) P ( A ) and the conditional entropy identity H ( A | B ) + H ( B ) = H ( B | A ) + H ( A ) . In light of our work, we feel we have an adequate categorical explanation for this analogy, which we now explain.
Let ( X , p ) f ( Y , q ) be an arbitrary morphism in FinPS , and suppose F : FinPS B R 0 is semi-functorial. Then the commutative diagram (cf. Definition A4)
Entropy 23 01021 i002
is a coalescable square (where γ is the swap map), i.e., ¡ f ¡ p and ( γ ¡ f ¯ ) ¡ q are both coalescable. The semi-functoriality of F then implies the identity F ( ¡ f ) + F ( ¡ p ) = F ( ¡ f ¯ ) + F ( ¡ q ) . Now suppose—as in the case of conditional entropy—that F satisfies the further condition that F ( f ) = F ( ¡ f ) . Then commutivity of (9) and this are equivalent to the following two respective equations:
Bayes   Rule _ : f y x p x = f ¯ x y q y Entropic   Bayes   Rule _ : F ( f ) + F ¡ p = F ( f ¯ ) + F ¡ q .
In the case that F = H , where H is the conditional entropy, we have H ( ¡ r | 1 ) = H ( r ) for every object ( Z , r ) in FinPS (where H ( r ) is the Shannon entropy). Thus, the entropic Bayes’ rule becomes H ( f | p ) + H ( p ) = H ( f ¯ | q ) + H ( q ) , which is the classical identity for conditional entropy.

10. Concluding Remarks

In this paper, we have provided novel characterizations of conditional entropy and the information loss of a probability-preserving stochastic map. The constructions we introduced to prove our main results are general enough to be applicable in the recent framework of synthetic probability [3]. By weakening functoriality and finding the appropriate substitute that we call semi-functoriality, we have shown how certain aspects of quantitative information theory can be done in this categorical framework. In particular, we have illustrated how Bayes’ rule can be formulated entropically and used as an axiom in characterizing conditional entropy.
Immediate questions from our work arise, such as the extendibility of conditional entropy to other Markov categories or even quantum Markov categories [10]. Work in this direction might offer a systematic approach towards conditional entropy in quantum mechanics. It would also be interesting to see what other quantitative features of information theory can be described from such a perspective, or if new ones will emerge.

Author Contributions

Both authors contributed equally to this manuscript. Both authors have read and agreed to the published version of the manuscript.

Funding

This research has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (QUASIFT grant agreement 677368).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We thank Tobias Fritz, Azeem ul Hassan, Francois Jacopin, Jiaxin Qiao, and Alex Atsushi Takeda for discussions.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Correctable Codes and Conditional Information Loss

In this appendix, we prove that the conditional information loss K ( f ) of a morphism ( X , p ) f ( Y , q ) in FinPS vanishes if and only if f is a disintegration, or equivalently, if and only if f is correctable (cf. Remark A1). Briefly, a disintegration is a particular kind of Bayesian inverse that we define momentarily. This provides an additional interpretation of the conditional information loss, namely as a deviation from correctability.
Definition A1.
Let ( Y , q ) g ( X , p ) be a morphism in FinPS . Then ( X , p ) f ( Y , q ) is said to be adisintegrationof g (or ( g , q , p ) for clarity) if and only if g f = p id X .
Lemma A1.
If f is a disintegration of g, then g is q-a.e. deterministic and f is a Bayesian inverse of g. Conversely, if g is q-a.e. deterministic, then every Bayesian inverse of g is a disintegration of g.
Proof. 
This is proved in a more abstract setting in ([10], Section 8). □
Theorem A1.
Let ( X , p ) f ( Y , q ) be a morphism in FinPS . Then K ( f ) = 0 if and only if there exists a map ( Y , q ) g ( X , p ) such that f is a disintegration of g.
Proof. 
The theorem will be proved by showing the equivalent statement ‘ H ( f ¯ | q ) = 0 if and only if f ¯ is q-a.e. deterministic for some Bayesian inverse f ¯ of f.’ We therefore first justify this as being equivalent to the claim.
First, K ( f ) = 0 holds if and only if H ( f ¯ ) = 0 for some (and hence any) Bayesian inverse f ¯ of f by Corollary 3 and Lemma 8. Second, the statement ‘there exists a g such that f is a disintegration of g’ holds if and only if ‘there exists a g such that f is a Bayesian inverse of g and g is q-a.e. deterministic’ by Lemma A1. However, since Bayesian inverses always exist (Theorem 3), and because Bayesian inversion is symmetric (item (ii) in Proposition 6), this latter statement is equivalent to ‘there exists a q-a.e. deterministic Bayesian inverse g of f.’
Hence, suppose f has a q-a.e. deterministic Bayesian inverse g. Then
H ( g | q ) = y Y q y H ( g y ) = y Y \ N q q y H ( g y ) = 0 ,
since the entropy of g y vanishes because it is { 0 , 1 } -valued for all y Y \ N q .
Conversely, suppose H ( f ¯ ) = 0 for some Bayesian inverse f ¯ of f. Then H ( f ¯ ) = y Y \ N q q y H ( f ¯ y ) is a sum of non-negative numbers that vanishes. Hence, H ( f ¯ y ) = 0 for all y Y \ N q . But since the entropy of a probability measure on a finite set vanishes if and only if the probability measure is { 0 , 1 } -valued, f ¯ y is { 0 , 1 } -valued for all y Y \ N q . Hence, f ¯ is q-a.e. deterministic. □
Remark A1
(Vanishing of Conditional Information Loss in Terms of Correctable Codes). The vanishing of the conditional information loss is closely related to the correctability of classical codes. Our references for correctable codes include [12,13], though our particular emphasis in terms of possibilistic maps instead of stochastic maps appears to be new. The correctability of classical codes does not require the datum of a stochastic map, but rather that of a possibilistic map.
Apossibilistic map(also called afull relation) from a finite set X to a finite set Y is an assignment f sending x X to a nonempty subset f x Y . Such a map can also be viewed as a transition kernel X f Y such that f y x { 0 , 1 } for all x X and y Y and for each x X there exists a y Y such that f y x = 1 . Aclassical codeis a tuple ( A , X , Y , E , N ) consisting of finite sets A , X , Y , an inclusion A E X (theencoding), and a possibilistic map X N Y (thenoise). Such a classical code iscorrectableiff there exists a possibilistic map Y D A (therecovery map) such that D N E = id A .
Associated with every morphism ( X , p ) f ( Y , q ) in FinPS is a classical code given by
A : = X \ N p , X , Y , E : = incl , N : = f ,
where X \ N p incl X denotes the usual inclusion and where f is the possibilistic map defined by
f y x : = f y x : = 1 i f   f y x > 0 0 o t h e r w i s e
as a transition kernel, or equivalently
f x : = y Y : f y x > 0
as a full relation.
Now, if K ( f ) = 0 , then by Theorem A1, there exists a ( Y , q ) g ( X , p ) such that f is a disintegration of ( g , q , p ) . Thus, g f = p id X . Since A : = X \ N p , the map g restricts to a deterministic map g : B A , where B : = x X f x Y . Since g is deterministic, it is also possibilistic. Let D : Y A be any extension of g to a possibilistic map. This map satisfies D N E = id A precisely because g f = p id X . Thus, ( A , X , Y , E , N ) is correctable.
Conversely, suppose ( A , X , Y , E , N ) as in (A1) is correctable, with a possibilistic recovery map Y D A . Then N restricts to a deterministic map N : B A , which is, in particular, a stochastic map. Thus, set g : Y X to be the stochastic map given by the composite X N A E X . Then f is a disintegration of ( g , q , p ) .
This gives a physical interpretation to the vanishing of conditional information loss. Namely, K ( f ) = 0 if and only if ( A : = X \ N p , X , Y , E : = incl , N : = f ) is correctable.

Appendix B. The Markov Category Setting

In this appendix, we gather some definitions and results that indicate how our formalism extends to the setting of Markov categories [2,3] in terms of string diagrams [14].
Definition A2.
AMarkov categoryis a symmetric monoidal category ( M , , I ) , with the tensor product and I the unit (associators and unitors are excluded from the notation), and where each object X in M is equipped with morphisms !X Entropy 23 01021 i003: X → I and ΔX Entropy 23 01021 i004: X → X X, all satisfying the following conditions
Entropy 23 01021 i005
expressed using string diagrams. In addition, every morphism X f Y is natural with respect to Entropy 23 01021 i006in the sense that Entropy 23 01021 i007. Astateon X is a morphism I p X , which is drawn as Entropy 23 01021 i008.
FinStoch is a Markov category (cf. Section 2). Although the definitions and results that follow are stated for stochastic maps, many hold for arbitrary Markov categories as well.
Definition A3
(Definition 7 in body). Let ( X , p ) f ( Y , q ) be a morphism in FinPS . The joint distribution associated with f is given by the following commutative diagram/string diagram equality:
Entropy 23 01021 i009
Proposition A1
(Extending Proposition 4). The composable pair ( X , p ) f ( Y , q ) g ( Z , r ) in FinPS is a.e. coalescable if and only if there exists a deterministic morphism Z × X h Y such that
Entropy 23 01021 i010
Proof. 
Item (c) of Proposition 4 is a direct translation of this string diagram in terms of the composition of stochastic maps between finite sets. □
(A2) provides a categorical definition of mediator (cf. Definition 17).
Remark A2.
The morphism h in Proposition A1 is closely related to the abstract notion of conditionals in Markov categories ([3], Definition 11.5). Indeed, given morphisms X F Z × Y and I p X in a Markov category, ana.e. conditionalof F given Z is a morphism Z × X F | Z Y such that
Entropy 23 01021 i011
In our case, F = ( g × id Y ) Δ Y f and h is the mediator. Therefore, a mediator is a deterministic (or at least a.e. deterministic) a.e. conditional for a specific morphism constructed from a pair of composable morphisms.
Remark A3.
In string-diagram notation, Lemma 3 reads
Entropy 23 01021 i012
Example A1
(Example 2 in body). The mediator Z × X h Y in this case may be given by
Entropy 23 01021 i013
The following string-diagrammatic calculation
Entropy 23 01021 i014
where p-a.e. determinism of f was used in the second equality, shows that (A2) holds.
Definition A4
(Definition 21 in body). Let ( X , p ) f ( Y , q ) be a morphism in FinPS . A Bayesian inverse of a morphism ( X , p ) f ( Y , q ) in FinPS is a morphism ( Y , q ) f ¯ ( X , p ) such that the following diagram commutes/string diagram equality holds:
Entropy 23 01021 i015
Alternative proof of Propotion 8. 
A more abstract proof of Propotion 8 that is valid in an arbitrary classical Markov category can be given as follows:
Entropy 23 01021 i016
Definition A5
(Definition 23 in body). Given a stochastic map X f Y , the bloom X ¡ f X × Y and shriek X × Y ! f X of f are given by
Entropy 23 01021 i017
Diagrammatic proof of Propostion 10. 
(i)
First,
Entropy 23 01021 i018
where the second equality holds by the very definition of the joint distribution ϑ ( f ) .
(ii)
Secondly,
Entropy 23 01021 i019
(iii)
Finally, set the mediator function Y × X h X × Y to be the swap map. Then
Entropy 23 01021 i020
which proves that the pair ( X , p ) ¡ f ( X × Y , ϑ ( f ) ) ! f ¯ γ ( Y , q ) is coalescable. □

References

  1. Baez, J.C.; Fritz, T.; Leinster, T. A characterization of entropy in terms of information loss. Entropy 2011, 13, 1945–1957. [Google Scholar] [CrossRef]
  2. Cho, K.; Jacobs, B. Disintegration and Bayesian inversion via string diagrams. Math. Struct. Comp. Sci. 2019, 29, 938–971. [Google Scholar] [CrossRef] [Green Version]
  3. Fritz, T. A synthetic approach to Markov kernels, conditional independence and theorems on sufficient statistics. Adv. Math. 2020, 370, 107239. [Google Scholar] [CrossRef]
  4. Cover, T.M.; Thomas, J.A. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing); Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar] [CrossRef]
  5. Gromov, M. Probability, Symmetry, Linearity. 2014. Available online: https://www.youtube.com/watch?v=aJAQVletzdY (accessed on 17 November 2020).
  6. Fritz, T. Probability and Statistics as a Theory of Information Flow. 2020. Available online: https://youtu.be/H4qbYPPcZU8 (accessed on 11 November 2020).
  7. Mac Lane, S. Categories for the Working Mathematician, 2nd ed.; Graduate Texts in Mathematics; Springer: New York, NY, USA, 1998; Volume 5. [Google Scholar] [CrossRef]
  8. Parzygnat, A.J. A functorial characterization of von Neumann entropy. arXiv 2020, arXiv:2009.07125. [Google Scholar]
  9. Fong, B. Causal Theories: A Categorical Perspective on Bayesian Networks. Master’s Thesis, University of Oxford, Oxford, UK, 2012. [Google Scholar]
  10. Parzygnat, A.J. Inverses, disintegrations, and Bayesian inversion in quantum Markov categories. arXiv 2020, arXiv:2001.08375. [Google Scholar]
  11. Smithe, T.S.C. Bayesian Updates Compose Optically. arXiv 2020, arXiv:2006.01631. [Google Scholar]
  12. Chang, H.H. An Introduction to Error-Correcting Codes: From Classical to Quantum. arXiv 2006, arXiv:0602157. [Google Scholar]
  13. Knill, E.; Laflamme, R.; Ashikhmin, A.; Barnum, H.; Viola, L.; Zurek, W.H. Introduction to Quantum Error Correction. arXiv 2002, arXiv:0207170. [Google Scholar]
  14. Selinger, P. A Survey of Graphical Languages for Monoidal Categories. Lect. Notes Phys. 2010, 289–355. [Google Scholar] [CrossRef] [Green Version]
Figure 1. A visualization of bloom and the bloom-shriek factorization via water droplets as inspired by Gromov [5]. The bloom of f splits each water droplet of volume 1 (an element of X) into several water droplets whose total volume equates to 1. If X has a probability p on it, then the initial volume of that water droplet is scaled by this probability. The stochastic map therefore splits the water droplet using this scale.
Figure 1. A visualization of bloom and the bloom-shriek factorization via water droplets as inspired by Gromov [5]. The bloom of f splits each water droplet of volume 1 (an element of X) into several water droplets whose total volume equates to 1. If X has a probability p on it, then the initial volume of that water droplet is scaled by this probability. The stochastic map therefore splits the water droplet using this scale.
Entropy 23 01021 g001
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Fullwood, J.; Parzygnat, A.J. The Information Loss of a Stochastic Map. Entropy 2021, 23, 1021. https://doi.org/10.3390/e23081021

AMA Style

Fullwood J, Parzygnat AJ. The Information Loss of a Stochastic Map. Entropy. 2021; 23(8):1021. https://doi.org/10.3390/e23081021

Chicago/Turabian Style

Fullwood, James, and Arthur J. Parzygnat. 2021. "The Information Loss of a Stochastic Map" Entropy 23, no. 8: 1021. https://doi.org/10.3390/e23081021

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop