The Information Loss of a Stochastic Map

We provide a stochastic extension of the Baez–Fritz–Leinster characterization of the Shannon information loss associated with a measure-preserving function. This recovers the conditional entropy and a closely related information-theoretic measure that we call conditional information loss. Although not functorial, these information measures are semi-functorial, a concept we introduce that is definable in any Markov category. We also introduce the notion of an entropic Bayes’ rule for information measures, and we provide a characterization of conditional entropy in terms of this rule.


Introduction
The information loss K( f ) associated with a measure-preserving function (X, p) f − → (Y, q) between finite probability spaces is given by the Shannon entropy difference where H(p) := − ∑ x∈X p x log p x is the Shannon entropy of p (and similarly for q). In [1], Baez, Fritz, and Leinster proved that the information loss satisfies, and is uniquely characterized up to a non-negative multiplicative factor by, the following conditions: 0.
Positivity: K( f ) ≥ 0 for all (X, p) f − → (Y, q). This says that the information loss associated with a deterministic process is always non-negative. 1.
Functoriality: K(g • f ) = K(g) + K( f ) for every composable pair ( f , g) of measurepreserving maps. This says that the information loss of two successive processes is the sum of the information losses associated with each process. 2.
Continuity: K( f ) is a continuous function of f . This says that the information loss does not change much under small perturbations (i.e., is robust with respect to errors).
As measure-preserving functions may be viewed as deterministic stochastic maps, it is natural to ask whether there exist extensions of the Baez-Fritz-Leinster (BFL) characterization of information loss to maps that are inherently random (i.e., stochastic) in nature.
In particular, what information-theoretic quantity captures such an information loss in this larger category?
This question is answered in the present work. Namely, we extend the BFL characterization theorem, which is valid on deterministic maps, to the larger category of stochastic maps. In doing so, we also find a characterization of the conditional entropy. Although the maps need not be deterministic. With this definition in place (which we also generalize to the setting of arbitrary Markov categories), we replace functoriality with the following weaker condition.
1 . Semi-functoriality: K(g • f ) = K(g) + K( f ) for every a.e. coalescable pair (X, p) f (Y, q) g (Z, r) of stochastic maps. This says that the conditional information loss of two successive processes is the sum of the conditional information losses associated with each process provided that the information in the intermediate step can always be recovered.
Replacing functoriality with semi-functoriality is not enough to characterize the conditional information loss. However, it comes quite close, as only one more axiom is needed. Assuming positivity, semi-functoriality, convex linearity, and continuity, there are several equivalent axioms that may be stipulated to characterize the conditional information loss. To explain the first option, we introduce a convenient factorization of every stochastic map (X, p) f (Y, q). The bloom-shriek factorization of f is given by the decomposition is the bloom of f whose value at x is the probability measure on X × Y given by sending (x, y) to δ x x f yx , where δ x x is the Kronecker delta. In other words, ! f records each of the probability measures f x on a copy of Y indexed by x ∈ X. A visualization of the bloom of f is given in Figure 1a. When one is given the additional data of probability measures p and q on X and Y, respectively, then Figure 1b illustrates the bloom-shriek factorization of f . From this point of view, ! f keeps track of the information encoded in both p and f , while the projection map π Y forgets, or loses, some of this information.

Figure 1.
A visualization of bloom and the bloom-shriek factorization via water droplets as inspired by Gromov [5].
The bloom of f splits each water droplet of volume 1 (an element of X) into several water droplets whose total volume equates to 1. If X has a probability p on it, then the initial volume of that water droplet is scaled by this probability. The stochastic map therefore splits the water droplet using this scale.
With this in mind, our final axiom to characterize the conditional information loss is 4(a). Reduction: shriek factorization of f . This says that the conditional information loss of f equals the information loss of the projection using the associated joint distribution on X × Y. Note that this axiom describes how K is determined by its action on an associated class of deterministic morphisms. These slightly modified axioms, namely, semi-functoriality, convex linearity, continuity, and reduction, characterize the conditional information loss and therefore extend Baez, Fritz, and Leinster's characterization of information loss. A much simpler axiom that may be invoked in place of the reduction axiom which also characterizes conditional information loss is the following. 4(b). Blooming: K( ! p ) = 0, where ! p is the unique map (•, 1) (X, p) from a one point probability space to (X, p). This says that if a process begins with no prior information, then there is no information to be lost in the process.
The conditional entropy itself can be extracted from the conditional information loss by a process known as Bayesian inversion, which we now briefly recall. Given a stochastic map (X, p) f (Y, q), there exists a stochastic map (Y, q) f (Y, q) such that f yx p x = f xy q y for all x ∈ X and y ∈ Y (the stochastic map f is the almost everywhere unique conditional probability so that Bayes' rule holds). Such a map is called a Bayesian inverse of f . The Bayesian inverse can be visualized using the bloom-shriek factorization because it itself has a bloom-shriek factorization f = π X • ! f . This is obtained by finding the stochastic maps in the opposite direction of the arrows so that they reproduce the appropriate volumes of the water droplets.
Given this perspective on Bayesian inversion, we prove that the conditional entropy of (X, p) f (Y, q) equals the conditional information loss of its Bayesian inverse (Y, q) f (X, p). Moreover, since the conditional information loss of f is just the information loss of π X , this indicates how the conditional entropy and conditional information losses are the ordinary information losses associated with the two projections π X and π Y in Figure 1b. This duality also provides an interesting perspective on conditional entropy and its characterization. Indeed, using Bayesian inversion, we also characterize the conditional entropy as the unique assignment F sending measure-preserving stochastic maps between finite probability spaces to real numbers satisfying conditions 0, 1 , 2, and 3 above, but with a new axiom that reads as follows.

4(c). Entropic Bayes' Rule
: . This is an information theoretic analogue of Bayes' rule, which reads f yx p x = f xy q y for all x ∈ X and y ∈ Y, or in more traditional probabilistic notation P(y|x)P(x) = P(x|y)P(y). In other words, we obtain a Bayesian characterization of the conditional entropy. This provides an entropic and information-theoretic description of Bayes' rule from the Markov category perspective, in a way that we interpret as answering an open question of Fritz [6].

Categories of Stochastic Maps
In the first few sections, we define all the concepts involved in proving that the conditional information loss satisfies the properties that we will later prove characterize it. This section introduces the domain category and its convex structure. Definition 1. Let X and Y be finite sets. A stochastic map f : X Y associates a probability measure f x to every x ∈ X. If f : X Y is such that f x is a point-mass distribution for every x ∈ X, then f is said to be deterministic. Notation 1. Given a stochastic map f : X Y (also written as X f Y), the value f x (y) ∈ [0, 1] will be denoted by f yx . As there exists a canonical bijection between deterministic maps of the form X Y and functions X → Y, deterministic maps from X to Y will be denoted by the functional notation X → Y.

Definition 2.
A stochastic map of the form • X from a single element set to a finite set X is a single probability measure on X. Its unique value at x will be denoted by p x for all x ∈ X. The set N p := {x ∈ X | p x = 0} will be referred to as the nullspace of p.

Definition 3.
Let FinStoch be the category of stochastic maps between finite sets. Given a finite set X, the identity map of X in FinStoch corresponds to the identity function id X : X → X. Second, given stochastic maps f : X Y and g : Y Z, the composite g • f : X Z is given by the Chapmann-Kolmogorov equation (g • f ) zx := ∑ y∈Y g zy f yx . Definition 4. Let X be a finite set. The copy of X is the diagonal embedding ∆ X : X → X × X, and the discard of X is the unique map from X to the terminal object • in FinStoch, which will be denoted by ! X : X → •. If Y is another finite set, the swap map is the map γ : X × Y → Y × X given by (x, y) → (y, x). Given morphisms f : X X and g : Y Y in FinStoch, the product of f and g is the stochastic map f × g : The product of stochastic maps endows FinStoch with the structure of a monoidal category. Together with the copy, discard, and swap maps, FinStoch is a Markov category [2,3].

Definition 5.
Let FinPS (this stands for "finite probabilities and stochastic maps") be the co-slice category • ↓ FinStoch, i.e., the category whose objects are pairs (X, p) consisting of a finite set X equipped with a probability measure p, and a morphism from (X, p) to (Y, q) is a stochastic map X f Y such that q y = ∑ x∈X f yx p x for all y ∈ Y. The subcategory of deterministic maps in FinPS will then be denoted by FinPD (which stands for "finite probabilities and deterministic maps"). A pair ( f , g) of morphisms in FinPS is said to be a composable pair iff g • f exists.
Note that the category FinPD was called FinProb in [1].

Remark 1.
Though it is often the case that we will denote a morphism (X, p) FinPS simply by f , such notation is potentially ambiguous, as the morphism (X, p ) is distinct from the morphsim (X, p) f (Y, q) whenever p = p . As such, we will only employ the shorthand of denoting a morphism in FinPS by its underlying stochastic map whenever the source and target of the morphism are clear from the context. Lemma 1. The object (•, 1) given by a single element set equipped with the unique probability measure is a zero object (i.e., terminal and initial) in FinPS. Definition 6. Given an object (X, p) in FinPS, the shriek and bloom of p are the unique maps to and from (•, 1) respectively, which will be denoted ! p : (X, p) → (•, 1) and ! p : (•, 1) (X, p) (the former is deterministic, while the latter is stochastic). The underlying stochastic maps associated with ! p and ! p are ! X : X → • and p : • X, respectively. Example 1. Since (•, 1) is a zero object, given any two objects (X, p) and (Y, q), there exists at least one morphism (Y, q) (X, p), namely the composite (Y, q) It is possible to take convex combinations of both objects and morphisms in FinPS, and such assignments will play a role in our characterization of conditional entropy. Definition 8. Let p : • X be a probability measure and let {(Y x , q x )} x∈X be a collection of objects in FinPS indexed by X. The p-weighted convex sum x∈X p x (Y x , q x ) is defined to be the set x∈X Y x equipped with the probability measure x∈X p x q x given by otherwise.

The Baez-Fritz-Leinster Characterization of Information Loss
In [1], Baez, Fritz, and Leinster (BFL) characterized the Shannon entropy difference associated with measure-preserving functions between finite probability spaces as the only non-vanishing, continuous, convex linear functor from FinPD to the non-negative reals (up to a multiplicative constant). It is then natural to ask whether there exist either extensions or analogues of their result by including non-deterministic morphisms from the larger category FinPS. Before delving deeper into such inquiry, we first recall in detail the characterization theorem of BFL.

Definition 9.
Let BR be the convex category consisting of a single object and whose set of morphisms is R. The composition in BR is given by addition. Convex combinations of morphisms are given by ordinary convex combinations of numbers. The subcategory of non-negative reals will be denoted BR ≥0 .
In the rest of the paper, we will not necessarily assume that assignments from one category to another are functors. Nevertheless, we do assume they form (class) functions (see ([7], Section I.7) for more details). Furthermore, we assume that they respect or reflect source and targets in the following sense. If C and D are two categories, all functions F : C → D are either covariant or contravariant in the sense that for any morphism a , respectively. These are the only types of functions between categories we will consider in this work. As such, we therefore abuse terminology and use the term functions for such assignments throughout. If M is a commutative monoid and BM denotes its one object category, then every covariant function C → BM is also contravariant and vice-versa.
We now define a notion of continuity for functions of the form F : FinPS → BR.
if and only if the following two conditions hold.
(a) There exists an N ∈ N for which X n = X and Y n = Y for all n ≥ N. (b) The following limits hold: lim n→∞ p n = p and lim n→∞ f n = f (note that these limits necessarily imply lim n→∞ q n = q).

Remark 2.
In the subcategory FinPD, since the topology of the collection of functions from a finite set X to another finite set Y is discrete, one can equivalently assume that a sequence f n as in Definition 10, but this time with all f n deterministic, converges to (X, p) f − → (Y, q) if and only if the following two conditions hold.
(a) There exists an N ∈ N for which X n = X, Y n = Y, f n = f for all n ≥ N. (b) For n ≥ N, one has lim n→∞ p n = p.
In this way, our definition of convergence agrees with the definition of convergence of BFL on the subcategory FinPD [1]. Definition 11. A function F : FinPS → BR is said to be convex linear if and only if for all objects (X, p) in FinPS, Definition 13. Let p : • X be a probability measure. The Shannon entropy of p is given by When considering any entropic quantity, we will always adhere to the convention that 0 log(0) = 0.
will be referred to as the information loss of f . Information loss defines a functor K : FinPD → BR, henceforth referred to as the information loss functor on FinPD.
Then F is a non-negative multiple of information loss. Conversely, the information loss functor is non-negative and satisfies conditions 1-3.
In light of Theorem 1, it is natural to question whether or not there exists a functor K : FinPS → BR ≥0 that restricts to FinPD as the information loss functor. It turns out that no such non-vanishing functor exists, as we prove in the following proposition.
Let (Y, q) g (X, p) be any morphism in FinPS (which necessarily exists by Example 1, for instance). Then a similar calculation yields Hence, F( f ) = 0.

Extending the Information Loss Functor
Proposition 1 shows it is not possible to extend the information loss functor to a functor on FinPS. Nevertheless, in this section, we define a non-vanishing function K : FinPS → BR ≥0 that restricts to the information loss functor on FinPD, which we refer to as conditional information loss. While K is not functorial, we show that it satisfies many important properties such as continuity, convex linearity, and invariance with respect to compositions with isomorphisms. Furthermore, in Section 5 we show K is functorial on a restricted class of composable pairs of morphisms (cf. Definition 18), which are definable in any Markov category. At the end of this section we characterize conditional information loss as the unique extension of the information loss functor satisfying the reduction axiom 4(a) as stated in the introduction. In Section 8, we prove an intrinsic characterization theorem for K without reference to the deterministic subcategory FinPD inside FinPS. Appendix A provides an interpretation of the vanishing of conditional information loss in terms of correctable codes.
Definition 15. The conditional information loss of a morphism (X, p) f (Y, q) in FinPS is the real number given by Proof of Lemma 2. Applying K to f yields Proof of Proposition 2. you found me! (i) The non-negativity of K follows from Lemma 2 and the equality q y = ∑ x ∈X f yx p x ≥ f yx p x . (Y x , q x ) be a collection of morphisms in FinPS indexed by X. Then the p-weighted convex sum x∈X p x Q x is a morphism in FinPS of the form (Z, r) h (Z , r ), where Z := x∈X Y x , Z := x∈X Y x , h := x∈X p x Q x , r := x∈X p x q x , and r := x∈X p x q x . Then which shows that K is convex linear.
, q (n) be a sequence (indexed by n ∈ N) of probabilitypreserving stochastic maps such that X (n) = X and Y (n) = Y for large enough n, and where lim n→∞ f (n) = f , lim n→∞ p (n) = p, and lim n→∞ q (n) = q. Then where the last equality follows from the fact that the limit and sum (which is finite) can be interchanged and all expressions are continuous on [0, 1].

Remark 3.
Since conditional entropy vanishes for deterministic morphisms, conditional information loss restricts to FinPD as the information loss functor. It is important to note that if the term H( f |p) was not included in the expression for K( f ), then the inequality K( f ) ≥ 0 would fail in general. When f is deterministic, Baez, Fritz, and Leinster proved H(p) − H(q) ≥ 0. However, when f is stochastic, the inequality H(p) − H(q) ≥ 0 does not hold in general. This has to do with the fact that stochastic maps may increase entropy, whereas deterministic maps always decrease it(while this claim holds in the classical setting as stated, it no longer holds for quantum systems [8]). As such, the term H( f |p) is needed to retain non-negativity as one attempts to extend BFL's functor K on FinPD to a function on FinPS.
Item (v) of Proposition 2 says that the conditional information loss of a map (X, p) in FinPD, so that conditional information loss of a morphism in FinPS may always be reduced to the information loss of a deterministic map in FinPD naturally associated with it having the same target. This motivates the following definition.  Then F is a non-negative multiple of conditional information loss. Conversely, conditional information loss satisfies conditions (i) and (ii).
Proof. This follows immediately from Theorem 1 and item (v) of Proposition 2.
In what follows, we will characterize conditional information loss without any explicit reference to the subcatgeory FinPD or the information loss functor of Baez, Fritz, and Leinster. To do this, we first need to develop some machinery.

Coalescable Morphisms and Semi-Functoriality
While conditional information loss is not functorial on FinPS, we know it acts functorially on deterministic maps. As such, it is natural to ask for which pairs of composable stochastic maps does the conditional information loss act functorially. In this section, we answer this question, and then we use our result to define a property of functions FinPS → BR that is a weakening of functoriality, and which we refer to as semi-functoriality. Our definitions are valid in any Markov category (cf. Appendix B).

Definition 17. A deterministic map Z
If in fact Equation (1) holds for all (z, x) ∈ Z × X, then h is said to be a strong mediator for the composable pair X  (a) For every x ∈ X \ N p and z ∈ Z, there exists at most one y ∈ Y such that g zy f yx = 0.
Proof. you found me! ((a)⇒(b)) For every (z, x) ∈ Z × (X \ N p ) for which such a y exists, set h(z, x) := y. If no such y exists or if x ∈ N p , set h(z, x) to be anything. Then h is a mediator for ( f , g). ((b)⇒(c)) Let h be a mediator for ( f , g). Since (2) holds automatically for x ∈ N p , suppose x ∈ X \ N p , in which case (2) is equivalent to h y(z,x) (g • f ) zx = g yz f yx for all (z, y) ∈ Z × Y. This follows from Equation (1) and the fact that h is a function. ((c)⇒(a)) Let (z, x) ∈ Z × (X \ N p ) and suppose (g • f ) zx > 0. If h is the mediator, then ∑ y∈Y g zy f yx = (g • f ) zx = g zh(z,x) f h(z,x)x . But since g zy f yx = 0 for all y = h(z, x), there is only one non-vanishing term in this sum, and it is precisely g zh(z,x) f h(z,x)x .
We first prove two lemmas.
(Z, r) be a pair of composable morphisms. Then Note that this equality still holds if g zy = 0 or f yx = 0 as each step in this calculation accounted for such possibilities.
p x g zy f yx log g zy f yx ∑ y g zy f y x . (4) Note that the order of the sums matters in this expression and also note that it is always well-defined since g zy f yx = 0 implies (g • f ) zx = 0.

Proof of Lemma 4. For convenience, temporarily set
which proves the claim due to the definition of the composition of stochastic maps.

Proof of Theorem 2. Temporarily set
In addition, note that the set of all x ∈ X \ N p and z ∈ Z \ N (g• f ) x can be given a more explicit description in terms of the joint distribution • s:=γ•ϑ(g• f ) Z × X associated with the composite g • f and prior p, namely s (z,x) := (g • f ) zx p x . Then, (⇒) Suppose ℵ = 0, which is equivalent to Equation (3) by Lemma 3. Then since each term in the sum from Lemma 4 is non-negative, Hence, fix such an x ∈ X \ N p , y ∈ Y \ N f x , z ∈ Z ∈ N g y . The expression here vanishes if and only if g zy f yx = (g • f ) zx , i.e., g zy f y Hence, for every x ∈ X \ N p and z ∈ Z \ N (g• f ) x , there exists a unique y ∈ Y such that g zy f yx = 0. But by (5), this means that for every (z, x) ∈ (Z × X) \ N s , there exists a unique y ∈ Y such that g zy f yx = 0. This defines a function (Z × X) \ N s → Y \ N q which can be extended in an s-a.e. unique manner to a function Z × X h − → Y. We now show the function h is in fact a mediator for the composable pair (g, f ). The equality clearly holds if x ∈ N p since both sides vanish. Hence, suppose that x ∈ X \ N p . Given y ∈ Y, z ∈ Z, the left-hand-side of (2) is given by by Equation (6). Similarly, if x ∈ X \ N p and z ∈ N (g• f ) x , then g zy f yx = 0 for all y ∈ Y because otherwise (g • f ) zx p x would be nonzero. If instead z ∈ Z \ N (g• f ) x , then g zh(z,x) f h(z,x)x = 0 and g zy f yx = 0 for all y ∈ Y \ {h(z, x)} by (6). Therefore, (2) holds.
(⇐) Conversely, suppose a mediator h exists and let X k Z × Y be the stochastic map given on components by k (z,y)x := h y(z,x) (g • f ) zx . Then Proof. Since the Shannon entropy difference is always functorial, the conditional information loss is functorial on a pair of morphisms if and only if the conditional entropy is functorial on that pair. Theorem 2 then completes the proof. Example 2. In the notation of Theorem 2, suppose that f is a.e. deterministic, which means f yx = δ y f (x) for all x ∈ X \ N p for some function f (abusive notation is used). In this case, the deviation from functoriality, (4), simplifies to Therefore, if f is p-a.e. deterministic, H(g|q) + H( f |p) = H(g • f |p). In this case, the mediator is a.e. coalescable for any g. In particular, every pair of composable morphisms in FinPD is coalescable.
In light of Theorem 2 and Corollary 1, we make the following definition, which will serve as one of the axioms in our later characterizions of both conditional information loss and conditional entropy. Proposition 5. Suppose F : FinPS → BR is semi-functorial. Then the restriction of F to FinPD is functorial. In particular, if F is, in addition, convex linear, continuous, and reductive, then F is a non-negative multiple of conditional information loss.
Proof. By Example 2, every pair of composable morphisms in FinPD is a.e. coalescable. Therefore, F is functorial on FinPD. The second claim then follows from Proposition 3.
The following lemma will be used in later sections and serves to illustrate some examples of a.e. coalescable pairs. r) be a triple of composable morphisms with e deterministic and g invertible. Then each of the following pairs are a.e. coalescable: The last two claims follow from the proofs of the first two claims.

Bayesian Inversion
In this section, we recall the concepts of a.e. equivalence and Bayesian inversion phrased in a categorical manner [2,3,9], as they will play a significant role moving forward.
Definition 20. Let (X, p) f (Y, q) and (X, p) g (Y, q) be two morphisms in FinPS with the same source and target. Then f and g are said to almost everywhere equivalent (or p-a.e. equivalent) if and only if f yx = g yx for every x ∈ X with p x = 0. In such a case, the p-a.e. equivalence of f and g will be denoted f = p g. [2,9,10]). Let (X, p) f (Y, q) be a morphism in FinPS. Then there exists a morphism (Y, q) f (X, p) such that f xy q y = f yx p x for all x ∈ X and y ∈ Y.

Theorem 3 (Bayesian Inversion
Furthermore, for any other morphism (Y, q) f (X, p) satisfying this condition, f = q f .
Definition 21. The morphism (Y, q) f (X, p) appearing in Theorem 3 will be referred to as a Bayesian inverse of (X, p) f (Y, q). It follows that f xy = p x f yx /q y for all y ∈ Y with q y = 0.  (ii) Given two morphisms (X, p) f (Y, q) and (Y, q) g (X, p) in FinPS, then f is a Bayesian inverse of g if and only if g is a Bayesian inverse of f .
(iii) Let (Y, q) f (X, p) be a Bayesian inverse of (X, p) f (Y, q), and let γ : X × Y → Y × X be the swap map (as in Definition 4). Then ϑ( f ) = γ • ϑ( f ) (iv) Let ( f , g) be a composable pair of morphisms in FinPS, and suppose f and g are Bayesian inverses of f and g respectively. Then (g, f ) is a composable pair, and f • g is a Bayesian inverse of g • f .
Proof. These are immediate consequences of the categorical definition of a Bayesian inverse (see [3,10,11] for proofs). This is mildly abusive terminology since functoriality only holds in the a.e. sense, as explained in the following remark.
For items (ii) and (iii), note that ! f and ! f can be expressed as composites of isomorphisms and certain convex combinations, namely Then by items (ii) and (iii) of Lemma 7 = F(g) by item (i) of Lemma 7, as desired. Then F is a non-negative multiple of conditional information loss. Conversely, conditional information loss satisfies conditions 1-4.

An Intrinsic Characterization of Conditional Information Loss
Proof. Suppose F satisfies conditions 1-4, let (X, p) f (Y, q) be an arbitrary morphism in FinPS, and let γ : Then Thus, F is reductive (see Definition 16) and Proposition 5 applies. This gives not just a simple mathematical criterion, but one with a simple intuitive interpretation as well. Namely, condition 4 says that if a process begins with no prior information, then there is no information to be lost in the process.
We now use Theorem 4 and Bayesian inversion to prove a statement dual to Theorem 4. Then F is a non-negative multiple of conditional entropy. Conversely, conditional entropy satisfies conditions 1-4.
Before giving a proof, we introduce some terminology and prove a few lemmas. We also would like to point out that condition 4 may be given an operational interpretation as follows: if a communication channel has a constant output, then it has no conditional entropy.
Definition 26. Let F : FinPS → BR be a function and let B be a Bayesian inversion functor. Then F := F • B will be referred to as a Bayesian reflection of F.

Remark 9.
By Proposition 11, if F : FinPS → BR is a convex linear semi-functor, then a Bayesian reflection is independent of the choice of a Bayesian inversion functor, and as such, is necessarily unique. = F( f ) + F(g) by Proposition 8 = F( f ) + F(g) by Lemma 8.
Convex Linearity: Given any probability space (X, p) and a family of morphisms (Y x , q x ) Thus, F is a non-negative multiple of conditional entropy.

A Bayesian Characterization of Conditional Entropy
We now prove a reformulation of Theorem 5, where condition 4 is replaced by a condition that we view as an 'entropic Bayes' rule'.
Definition 27. A function F : FinPS → BR satisfies an entropic Bayes' rule if and only if for every morphism (X, p) f (Y, q) in FinPS and any Bayesian inverse f of f .

Remark 10.
The entropic Bayes' rule is an abstraction of the conditional entropy identity (7). Then F is a non-negative multiple of conditional entropy. Conversely, conditional entropy satisfies conditions 1-4.
Proof. By Theorem 5, it suffices to show F(! p ) = 0 for every object (X, p) in FinPS. For this, first note that Applying the entropic Bayes' rule from Definition 27 to the morphism ! p : (X, p) (•, 1) yields as desired.
Remark 11. In ( [6], slide 21), Fritz asked if there is a Markov category for information theory explaining the analogy between Bayes' rule P(A|B)P(B) = P(B|A)P(A) and the conditional entropy identity H(A|B) + H(B) = H(B|A) + H(A). In light of our work, we feel we have an adequate categorical explanation for this analogy, which we now explain.
Let (X, p) f (Y, q) be an arbitrary morphism in FinPS, and suppose F : FinPS → BR ≥0 is semi-functorial. Then the commutative diagram (cf. Definition A4) is a coalescable square (where γ is the swap map), i.e., ! f • ! p and (γ • ! f ) • ! q are both coalescable. The semi-functoriality of F then implies the identity F( . Now supposeas in the case of conditional entropy-that F satisfies the further condition that F( f ) = F( ! f ). Then commutivity of (9) and this are equivalent to the following two respective equations: Bayes' Rule : f yx p x = f xy q y Entropic Bayes' Rule : In the case that F = H, where H is the conditional entropy, we have H( ! r |1) = H(r) for every object (Z, r) in FinPS (where H(r) is the Shannon entropy). Thus, the entropic Bayes' rule becomes H( f |p) + H(p) = H( f |q) + H(q), which is the classical identity for conditional entropy.

Concluding Remarks
In this paper, we have provided novel characterizations of conditional entropy and the information loss of a probability-preserving stochastic map. The constructions we introduced to prove our main results are general enough to be applicable in the recent framework of synthetic probability [3]. By weakening functoriality and finding the appropriate substitute that we call semi-functoriality, we have shown how certain aspects of quantitative information theory can be done in this categorical framework. In particular, we have illustrated how Bayes' rule can be formulated entropically and used as an axiom in characterizing conditional entropy.
Immediate questions from our work arise, such as the extendibility of conditional entropy to other Markov categories or even quantum Markov categories [10]. Work in this direction might offer a systematic approach towards conditional entropy in quantum mechanics. It would also be interesting to see what other quantitative features of information theory can be described from such a perspective, or if new ones will emerge. Conversely, suppose (A, X, Y, E, N) as in (A1) is correctable, with a possibilistic recovery map Y D A. Then N restricts to a deterministic map N : B → A, which is, in particular, a stochastic map. Thus, set g : Y / / X to be the stochastic map given by the composite X N A E − → X. Then f is a disintegration of (g, q, p).
This gives a physical interpretation to the vanishing of conditional information loss. Namely, K( f ) = 0 if and only if (A := X \ N p , X, Y, E := incl, N := f ) is correctable.