Dirichlet Polynomials and Entropy

A Dirichlet polynomial d in one variable y is a function of the form d(y)=anny+⋯+a22y+a11y+a00y for some n,a0,…,an∈N. We will show how to think of a Dirichlet polynomial as a set-theoretic bundle, and thus as an empirical distribution. We can then consider the Shannon entropy H(d) of the corresponding probability distribution, and we define its length (or, classically, its perplexity) by L(d)=2H(d). On the other hand, we will define a rig homomorphism h:Dir→Rect from the rig of Dirichlet polynomials to the so-called rectangle rig, whose underlying set is R⩾0×R⩾0 and whose additive structure involves the weighted geometric mean; we write h(d)=(A(d),W(d)), and call the two components area and width (respectively). The main result of this paper is the following: the rectangle-area formula A(d)=L(d)W(d) holds for any Dirichlet polynomial d. In other words, the entropy of an empirical distribution can be calculated entirely in terms of the homomorphism h applied to its corresponding Dirichlet polynomial. We also show that similar results hold for the cross entropy.


Introduction
The purpose of this paper is simply to provide another categorical treatment of entropy of probability distributions, which turns out to be computed in terms of a rig homomorphism; our treatment also generalises to cross entropy (and thus to Kullback-Leibler divergence). What is particularly interesting about the treatment outlined here is that we can somewhat "visualise" the notion of entropy in terms of sizes of coding schemes (cf. Section 6). Not only that, but classical entropy is only homomorphic in the product of distributions, whereas the notion that we describe here is homomorphic in both the product and the sum.
A brief outline of this paper is as follows: Section 2: We recall the definitions of Dirichlet polynomials and set-theoretic bundles, along with their rig structures, from [1] (one important thing to note is the following: Dirichlet polynomials are well-studied objects in the setting of complex analysis, but we cannot apply tools from this area to our setting, because we have only natural number coefficients and non-negative exponents); we then study the equivalence between these two notions. Section 3: We explain how empirical probability distributions correspond to set-theoretic bundles (and thus to Dirichlet polynomials). Section 4: We define the rig homomorphism h : Dir → Rect that we wish to study, whose codomain is a rig encoding the weighted geometric mean; we prove some useful computational results and give some explicit examples. Section 5: We define the entropy H(d) of a Dirichlet polynomial using the classical notion of Shannon entropy; we give some explicit examples; we prove the main result of this paper (Theorem 1), relating entropy to the rig homomorphism defined in the previous section. Section 6: We try to provide some intuition for the image h(d) of a Dirichlet polynomial under the rig homomorphism, in terms of coding schemes. Section 7: We generalise Theorem 1 to the case of cross-entropy, or Kullback-Leibler divergence.

Prerequisites
We assume that the reader is familiar with the fundamentals of category theory, such as functors, natural transformations, and (co)products, as covered, for example, in [2] (Chapter 1).

Dirichlet Polynomials and Bundles
This section is simply a brief summary of content from [1], repeated here for the convenience of the reader. Definition 1. A Dirichlet polynomial d in one variable y is a function of the form d(y) = a n n y + . . . + a 2 2 y + a 1 1 y + a 0 0 y for some n, a 0 , . . . , a n ∈ N.
The set of Dirichlet polynomials is clearly closed under addition, and further under multiplication (using the distributive law along with the fact that m y · n y = (m · n) y ). In fact, it has the structure of a rig: a "ring without negatives" (or, to be pedantic, a monoid object in commutative monoids). We denote this rig by Dir, where the additive unit is 0, and the multiplicative unit is 1 y .
Note that we can embed N as a sub-rig of Dir, by a → a · 1 y ; we often use this fact and simply write a ∈ Dir.
Following [1], we can think of Dirichlet polynomials as functors FSet op → FSet, where FSet is the category of finite sets. Indeed, given a natural number n ∈ N, the exponential n y can be thought of as the Yoneda embedding of the set with n elements, i.e., where n = {1, . . . , n}. (For typographical convenience, we sometimes use the notation n and n interchangeably. In particular, we write e.g., d(0) instead of d(0)). Then the addition of exponentials corresponds to the coproduct of the corresponding representable functors (and so multiplication by a natural number a n corresponds to the a n -fold coproduct of the representable functor with itself). This means that evaluating a Dirichlet polynomial at some natural number n corresponds to evaluating the corresponding functor on the finite set n.
Note that 0 y is not the initial object 0, since i.e., 0 y = 0.

Definition 2.
A morphism ϕ : d → e of Dirichlet polynomials is a natural transformation of (contravariant) functors. Denote by Dir the category of Dirichlet polynomials (thought of as functors FSet op → FSet), and by Dir (d, e) the set of all morphisms d → e.
When we think of Dirichlet polynomials as functors FSet op → FSet, addition is given by the coproduct (disjoint union of sets), and multiplication by the product (Cartesian product of sets). This means that working with Dirichlet polynomials in Dir really is like working with polynomials, in the sense that addition and multiplication are exactly "as expected". Example 2. The only slightly confusing aspect of multiplication in Dir is how 0 y behaves (since 0 y = 0): if d(y) is a Dirichlet polynomial, then d(y) · 0 y = d(0) · 0 y , as follows from the aforementioned fact that 0 n is zero for n = 0, and 1 for n = 0.
There is a more geometric interpretation of objects of Dir as set-theoretic bundles, i.e., (isomorphism classes of) functions E → B, where E, B ∈ FSet, given as follows: to the Dirichlet polynomial d : For example, to the polynomial d(y) = 4 y + 4 · 1 y , we associate the bundle Note that bundles also form a rig, where the sum is given by disjoint union of sets, and the product is given by the Cartesian product of sets, both on the base and the total space. Further, the equivalence between Dirichlet polynomials and bundles respects the rig structures (cf. [1], Theorem 4.6). Because of this, we often switch freely between thinking of Dirichlet polynomials as functors d : FSet op → FSet, as bundles π d : d(1) → d(0), and simply as functions of the form ∑ n j=0 a j · j y .

Example 3.
We can draw the bundle corresponding to (2 y + 1) · (2 y + 1) as follows: Proof. This follows from the fact that n 0 = 1 and n 1 = n, for all n ∈ N.
Using the fact that the sum of bundles is given by the disjoint union of sets, we can use this above definition to write any Dirichlet polynomial d as Proof. This is, again, simply the fact that n 1 = n for all n ∈ N. Lemma 2. A morphism ϕ : d → e of Dirichlet polynomials is exactly a morphism of the corresponding bundles, i.e., functions ϕ 0 : d(0) → e(0) and ϕ 1 :

commutes.
Proof. This statement forms a specific part of ([1], Theorem 4.6), but the proof is simple enough that we give a direct version here. Writing d(y) := ∑ i∈d(0) d[i] y and e(y) := ∑ i∈e(0) e[i] y , we see that (the first isomorphism is by the universal property of the coproduct; the second isomorphisms is the Yoneda lemma). However, an element of this set is exactly a bundle morphism: , and ϕ 0 is given by the the choice of j for each i.

Definition 4.
Given Dirichlet polynomials d, e ∈ Dir such that d(0) = e(0), we denote by Given the correspondence between Dirichlet polynomials and bundles, we might rightly ask why we should prefer to work with the former over the latter. For one possible answer to this, see Section 6.

Bundles as Empirical Distributions
The interpretation of Dirichlet polynomials as bundles helps us to understand how they relate to probability theory. Imagine flipping a coin eight times and observing five heads and three tails; we refer to "heads" and "tails" as outcomes, and each of the eight flips as draws; every draw has an associated outcome.
Consider some bundle π d : d(1) → d(0). We can think of d(0) as the set of outcomes, and d(1) as the set of draws; the fibre π −1 d (x) over an outcome x ∈ d(0) corresponds to all the draws that lead to the outcome x, and so we obtain a probability distribution on Conversely, any rational distribution (i.e., a distribution such that all probabilities are rational numbers ) on a finite set arises in this way: take the finite set as the set of outcomes; take the least common multiple of the denominators of all the probabilities as the cardinality of d(1); and then take P(X = x) · |d(1)| many elements of d(1) to be in the fibre of x ∈ d(0).

Example 4. Consider the set S
Define the sets d(0) = 4 and d(1) = 30, and define the function π d : d(1) → d(0) by Then the empirical probability distribution on the bundle π : d(1) → d(0) agrees exactly with the given distribution on S. As a Dirichlet polynomial, this bundle is given (up to relabelling the outcomes) by d(y) := 15 y + 6 y + 5 y + 4 y .
Note that any multiple m y · d(y) of d (for m 1) will correspond to the same probability distribution as d itself, but to a different empirical distribution, since it will have m times as many draws.
Under this interpretation of Dirichlet polynomials as empirical distributions, multiplication d · e corresponds to taking the product distribution.
Remark 1. For any d ∈ Dir, and any n ∈ N, we can give |d(n)| a combinatorial interpretation: it is the number of ways of choosing n indistinguishable (in the sense that they have the same outcome) draws, i.e., the number of length-n lists of elements of d[i] for some i ∈ d(0).
To see this, note that d(n) = Dir (n y , d) (by Yoneda), and so d(n) is in bijection with the set of bundle morphisms (ϕ 0 , ϕ 1 ) : (n → 1) → (d(1) → d(0)), which are given exactly by choosing n (possibly repeated) elements of d(1) that all lie in the same fibre (namely the fibre above the point specified by ϕ 0 (1)).

Remark 2.
Although we deal only with finite sets and rational probability distributions here, it seems likely that one could follow the methods of [3] and consider colimits of these to obtain analogous results for arbitrary probability distributions on discrete measurable spaces.

Area and Width
Definition 5. Define the rig Rect as follows. The underlying set is R 0 × R 0 . The multiplicative structure has unit (1, 1), and is given by component-wise multiplication: The additive structure has unit (0, 0), and is given by real-number addition in the first component, and by weighted geometric mean in the second component: Given an element (A, W) in Rect, we call A its area and W its width.
The fact that Rect is indeed a rig follows from the fact that its multiplication distributes over its addition: Proposition 1. There exists a unique rig morphism h : Dir → Rect for which h : n y → (n, n).
Proof. Since every Dirichlet polynomial is just a sum of exponentials, a rig homomorphism is fully determined by its action on exponentials, since it must respect addition. So we just need to show that h does indeed extend to a rig homomorphism, but this follows from the fact that m y · n y = (m · n) y .

Definition 6. Given a Dirichlet polynomial d, we define its area A(d) and its width W(d) to be given by the components of h(d) = (A(d), W(d)).
With this definition, along with Proposition 1, we see that A(n y ) = W(n y ) = n. Lemma 3. Let d ∈ Dir and a ∈ N. Then 1.

Proof. Recall that addition (and thus scalar multiplication) in
Rect involves the weighted geometric mean in the second component. Then which proves (i). For (ii) and (iii), since h is a rig homomorphism (and thus respects addition), it suffices to consider the case where d is an exponential, say d(y) = n y . However, then h(a · d) = a · h(d) = a · (n, n) = an, n n n n . . . n n a times 1/an = an, n an 1/an = (an, n) i.e., A(a · d) = aA(d) and W(a · d) = W(d), as claimed.

Corollary 2. Let d ∈ Dir. Then
Proof. Write d(y) = a n · n y + . . . + a 1 · 1 y + a 0 · 0 y . Using Lemma 3, along with Definition 5, we see that h(d) = (a n n, n) + (a n−1 (n − 1), n − 1) + . . . + (a 1 , 1) By Lemma 1, the first component (i.e., A(d)) is equal to |d(1)|; by the same lemma, we can also rewrite the second component (i.e., W(d)) as so it simply remains to justify why this is equal to is exactly the data of an endomorphism of each fibre of π d : d(1) → d(0); since there are a i fibres of size i, endomorphisms of these fibres are in bijection with the a i -fold product of i i , which is equal to i a i i , whence the claim. Corollary 3. Let d ∈ Dir. Then the width W(d) is an algebraic number, i.e., the image of h : Dir → Rect lies in the sub-rig whose underlying set is N × Q 0 , where Q is the algebraic closure of Q.
Proof. By Corollary 2, both W(d) A(d) and A(d) are equal to the cardinality of some sets, and thus integer.

Example 5.
Reassuringly, if we start with a "rectangle", then the area and width are exactly what we might expect. More concretely: consider d(y) = a · n y for some a, n ∈ N; then, by Corollary 2, and, by direct calculation, W(d) = n.
Comparing this to the picture of a · n y , we can explain why we chose the terminology "width" and "area": Indeed, the area is exactly the number of dots in the (upper) rectangle, and the width is its width.
However, this picture now leads us to consider the question of whether or not there is a good meaning we can give to the "length" of this rectangle (which, here, should be equal to a). Indeed, this has been our motivation all along; we will return to this question in Example 9. Example 6. The fact that d(1) is "rectangular" in Example 5 makes the terminology look like a numerical coincidence, but we can try to hone our intuition of what this really "means" by considering another example.
Let's consider d(y) = 4 y + 4, which has area A(d) = d(1) = 8. We can calculate its width by using the fact that h(4 y ) = (4, 4) h(4) = h(1 y ) + h(1 y ) + h(1 y ) + h(1 y ) = (4, 1) whence h(d) = (4, 4) + (4, 1) = 8, (4 4 1 4 ) and so W(d) = 2. How, then, does the rectangle with area 8 and width 2 relate to our Dirichlet polynomial d(y) = 4 y + 4? That is, what is the process that takes us from d to 4 · 2 y ? Looking at the pictures of the bundles, we see that the width tells us how our bundle would look if we had the same set (d(1)) of draws, but with different outcomes, now all equally likely: Note that, in order to have equally sized fibres, we needed to have 4 outcomes, not 5 (since 8/2 = 4). We make this idea more precise (as well as explain why the rectangle is of size 4 × 2 instead of 2 × 4) in Section 6.

Example 7.
We have just seen that d(y) = 4 y + 4 has W(d) = 2 and A(d) = 8, but now let's look at an example where the numbers don't divide so neatly.
Let d(y) = 4 y + 3. Of course, now we can't draw a nice rectangle representing the evenly distributed bundle as we did in Example 6 for 4 y + 4, since we would have to have an outcome set of size 7/2.21 ≈ 3.17 elements, with fibres all of size 2.21, but this should come as no surprise, since 7 is prime. One might be tempted to solve this problem using groupoid cardinality (cf. [4]), but there are some technical issues here.

Length
Remark 3. We write log to mean log 2 .
We then define its length (also called the perplexity) L(d) by Readers might recognise H(d) as being the Shannon entropy of the corresponding probability distribution (cf. [5]). The convention for naming the sides of a rectangle is from [6] (Figure 1). length width width In terms of distributions, this corresponds to the fact that the unique probability distribution on a single outcome has entropy equal to 0 (and so the same is true for any empirical distribution on a single outcome).

Example 9.
Continuing on from Example 5, we can calculate the entropy of a uniform distribution on a many outcomes d(y) = a · n y as H(a · n y ) = − ∑ i∈a n an log n an = − log 1 a = log a whence L(a · n y ) = 2 log a = a, exactly as desired.
As for d(y) = 4 y + 3, recall that its area is A(d) = 7 and its width is 2 7 √ 2. Now, its entropy is and so its length is Note that, in the above example, even though both L(d) and W(d) have non-integer values, the formula implied by our choice of nomenclature still holds: the area A(d) is equal to the length L(d) times the width W(d). This leads us to our main theorem. It says that the Shannon entropy, which is only homomorphic in products of distributions, can be computed in terms of the width and area, which together are homomorphic in both sums and products of distributions. We will explain this in more detail in Section 6.
Theorem 1. For all d ∈ Dir, we have the rectangle-area formula Proof. In the following, we omit absolute value signs, writing e.g., d(1) instead of |d(1).
First, write Now we can rewrite the length as The numerator is then

Interpreting Area, Length, and Width
We have mentioned many times that Dirichlet polynomials are equivalent to settheoretic bundles, so the natural question to ask is "why, then, should we work with the former instead of the latter?". One answer to this is question is the fact that entropy does not respect bundle morphisms: we cannot functorially assign a morphism between entropies to morphisms, since we are working with arbitrary morphisms of bundles. (If, however, we restrict to only morphisms given by pushforward, then [7] tells us (via Faddeev's theorem) that the only possible functorial definition of entropy is given by the relative entropy, i.e., the difference of the entropies of the source and the target). This makes it seem rather bad to work with a category (such as that of bundles) instead of simply a rig (such as that of Dirichlet polynomials). Of course, this isn't an entirely satisfactory answer, since we do care about the notion of morphisms for Dirichlet polynomials (for example, Corollary 2 tells us that the width can be expressed in terms of the number of certain morphisms). In light of Theorem 1, however, we might consider the following possibility: both area and length can be expressed in terms of d(0), d(1), and d[i] (for i ∈ d(0)), and we could define the width by W(d) := A(d)/L(d).
A better answer to this question might be the following: the rig homomorphism h : Dir → Rect is incredibly simple, since it just maps n y to (n, n); from this computationally simple homomorphism, however, we can recover entropy (as log(A(d)/W(d))), without making any reference to the classical equation that defines it ("negative the sum of probabilities of the log of the probabilities"), but instead relying on the fact that Rect encodes the weighted geometric mean. That is, H(d) is only homomorphic in the product

Cross Entropy
Everything above can be viewed as a specific example of the analogous cross notions. That is, given two Dirichlet polynomials, we can define their cross area, cross width, etc. as follows. By definition, X(d, d) = X(d) for X ∈ {A, W, H, L}. That is, just as cross entropy is a generalisation of entropy, the notions of cross width, etc., generalise the notions of width, etc.

Remark 4.
Note that we can recover the notion of relative entropy (also known as Kullback-Leibler divergence) D KL (p q), as studied in [9], from cross entropy: H(d, e) = H(d) + D KL (p q) (which can also be seen to justify the fact that H(d, d) = H(d)).

Remark 5.
Although we have some idea of how to understand these cross notions (e.g., cross area can be understood as the number of "actual" draws, when we think of d as being a potentially inaccurate model for e), the choice of definitions in Definition 8 was chosen simply so that 1.
we recover the "uncrossed" notions when we take d = e, and 2.