Channel-Supermodular Entropies: Order Theory and an Application to Query Anonymization

This work introduces channel-supermodular entropies, a subset of quasi-concave entropies. Channel-supermodularity is a property shared by some of the most commonly used entropies in the literature, including Arimoto–Rényi conditional entropies (which include Shannon and min-entropy as special cases), k-tries entropies, and guessing entropy. Based on channel-supermodularity, new preorders for channels that strictly include degradedness and inclusion (or Shannon ordering) are defined, and these preorders are shown to provide a sufficient condition for the more-capable and capacity ordering, not only for Shannon entropy but also regarding analogous concepts for other entropy measures. The theory developed is then applied in the context of query anonymization. We introduce a greedy algorithm based on channel-supermodularity for query anonymization and prove its optimality, in terms of information leakage, for all symmetric channel-supermodular entropies.


Introduction
The idea of preorders over channels goes back a long way in the history of information theory. For instance, in [1], Shannon introduced the "inclusion" preorder to compare the capacities of discrete memoryless channels. Several authors, such as El Gamal [2], Korner and Marton [3], and many more, made further significant contributions to the study of channel preorders.
Such preorders are of practical importance in information theory. For example, the "more capable" preorder [3] is used in calculating the capacity region of broadcast channels [2], or in deciding whether a system is more secure than another [4,5]. As discussed in the book by Cohen, Kempermann, and Zbaganu [6], the applications of preorders over stochastic matrices goes beyond the field of information theory, for instance, to statistics, economics, and population sciences.
In this work, which is an extension of the results in our previous work [7], we introduce a new preorder over channels. To illustrate the key idea, consider the following channel: Now build a new channel from it as follows: take the first two columns, and for each row, rearrange their pairwise entries such that the larger element is moved to the first column, and the smaller element is in the second column. This yields:

Notational Conventions
Throughout the paper, X, Y, Z, . . . represent discrete random variables with (nonempty, finite) alphabets X , Y, Z, . . . . We assume that the elements of each alphabet are ordered, denoting by x 1 , . . . , x |X | the elements of X , by y 1 , . . . , y |Y | the elements of Y, and so on. Given x i ∈ X , we write p(x i ) or p i to mean Pr{X = x i }, and use p to refer to the (categorical) distribution (as a vector). We may specify the r.v. with a subscript, for example, writing p X (x), if it is not clear from the context.
We denote by ∆ n ⊂ R n the (n − 1)-dimensional probability simplex. Given a probability distribution p over {x 1 , . . . , x n }, we overload the notation and use p to refer to its probability vector (p 1 , . . . , p n ) ∈ ∆ n . We write (p [1] , p [2] , . . . , p [n] ) for the nonincreasing rearrangement of p = (p 1 , . . . , p n )-that is, p [1] denotes largest element of p, p [2] the second largest, and so forth. Given a vector r = (r 1 , . . . , r n ) ∈ R n , we denote by r α its α-norm ∑ r α i 1 /α (note that this is a slight abuse of nomenclature, since it is not a norm when α < 1). Given a function F over ∆ n and a random variable X with distribution p = (p 1 , . . . , p n ), we use F(X), F(p 1 , . . . , p n ) and F(p) interchangeably. A channel K is a row stochastic matrix with rows indexed by X and columns indexed by Y. The value K(x, y) is equal to p(y|x) = Pr{Y = y | X = x}, that is, the conditional probability that y is produced by the channel K when x is the input value. The notation K : X → Y means that the channel K has X and Y as input and output alphabets, respectively.
Channels are represented in a table format, or simply in the matrix notation, for example: with the understanding that the ith row corresponds to x i , and the jth column to y j .

Core-Concave Entropies
The main result this work provides is the monotonicity of conditional entropy with regard to the Join-Meet operator, which will be introduced in Section 3. This holds not only for Shannon entropy, but also for a number of different entropy measures, including the Arimoto-Rényi entropies (which include Shannon and min-entropy as limit cases) [43], and the guessing entropy [44].
Most of the results in this paper concern the aforementioned entropies, which are instances of what we call channel-supermodular entropies. To define a channel supermodular entropy, however, we first need a generalizing framework. To this end, we introduce the core-concave entropies based on the framework introduced in [28]. Besides the ones aforementioned, they also include the Tsallis [45] and Sharma-Mittal [46] entropies.

Definition 1.
A "core-concave" entropy H is a pair (η, F) such that • F : ∆ n → R is a concave and continuous function, • η is a strictly increasing continuous real-valued function, defined on the image of the function F.
Given a core-concave H = (η, F), we define H(X) = η(F(p X )). The set of core-concave entropies will be denoted by H.
Besides the value H(X), which we refer to as the unconditional form, we also define a conditional form for the core-concave entropies, with relation to two random variables X, Y. As claimed before, core-concave entropies encompass the most common entropies in the literature. Some of these are summarized in Table 1, together with their conditional form. Notice that H is used to denote an arbitrary core-concave entropy, while H 1 refers to Shannon entropy.

η(r) F(p) Conditional Form H(X|Y)
Hayashi-Rényi [47] (H α ) Tsallis [45] H (α,α) Based on Definitions 1 and 2, we define a notion of mutual information and channel capacity for each H ∈ H. Definition 3. Let H ∈ H. The H-mutual information is defined as When X and Y are respectively the input and output of a channel K, the H-channel capacity of K is defined as Notice that, in general, H-mutual information is not symmetric. Core-concave entropies satisfy the data-processing inequality.
Theorem 1 ([28] Proposition 2(b)). Let X, Y, Z be random variables such that X → Y → Z (i.e., X and Z are conditionally independent given Y). Then, for all H ∈ H,

Preorders over Channels
Let K 1 : X → Y, K 2 : X → Z be channels which share an input X and produce outputs Y, Z.
A channel K 2 is degraded from K 1 [8], written as In [1], Shannon introduced a preorder which includes the one above. A channel K 1 includes K 2 , written as K 1 ≥ sh K 2 , if there exists a family of tuples {(g i , T i , R i )} i of channels T i , R i and non-negative real numbers g i such that: As noted by Shannon, any channel can be expressed as a convex combination of deterministic channels. Thus, whenever K 1 ≥ sh K 2 , it is possible to choose {(g i , T i , R i )} i such that T i , R i are deterministic channels and (1) holds. The two preorders defined above are not dependent on a particular entropy but on the structure of the channel. The next preorders depend on the choice of a core-concave entropy H and generalize preorders introduced in [3].
A channel K 1 is H-less noisy than K 2 , denoted as K 1 ≥ H ln K 2 , if for all random variables U with finite support such that U → X → (Y, Z), K 1 is H-more capable than K 2 , written as K 1 ≥ H mc K 2 , if for all distributions of the input, X, If A ⊂ H is a subset of core-concave entropies, K 1 ≥ A ln K 2 is defined as: This is similar for ≥ A mc and ≥ A c .

Relationships between Preorders
We now explore some relationships between the preorders defined above.

Proposition 1.
For any H ∈ H: Proof. The first implication follows from Theorem 1, the second implication follows by choosing U = X in the definition of ≥ H ln , and the third is straightforward. While the reverse of the above implications is false, the following is true: The following important theorem can be traced back to Blackwell's result on comparison of experiments [21] and relates to more recent results in [20,48].

Channel-Supermodular Entropies
Note that any specific core-concave entropy H induces a H-more capable preorder over channels. However, this preorder might not be preserved for a different choice of conditional entropy. Theorem 3 characterizes a channel preorder that is "consistent" for all core-concave entropies. As strong as this result is, it still leaves the question whether there exists a preorder that is consistent for a class of entropies of interest. This is motivated by the fact that the class of core-concave entropies include far more entropies than the conventionally used ones, so it may include some eccentric ones that can be excluded for a stronger result. Moreover, the degradedness relation between channels seems very restrictive: there are many channels that cannot be compared with respect to degradedness, but have consistent ordering with respect to all typically used entropies.
With these motivations in mind, we introduce channel-supermodular entropies. Channelsupermodularity is a property satisfied by a significant portion of commonly used entropies, being a helpful tool in optimization problems (as shown in Section 6.1). The characterization of channel-supermodular entropies is linked to supermodular functions over the real lattice. These functions and some basic properties are introduced next. For details about supermodular functions please refer to [49] and [50] (Chapter 6.D).
Consider the set R n ≥0 of all n-dimensional vectors with no negative entries (i.e., the non-negative orthant of R n ). Let represent the element-wise inequality, that is, given r = (r 1 , . . . , r n ) and s = (s 1 , . . . , s n ), r s iff r i ≤ s i for all i.
Recall that: Next, we introduce some fundamental definitions for this work:

Definition 5.
Let H = (η, F) be a core-concave entropy. Define the function G F : R n ≥0 → R as if r is not the null vector, and G F (0, . . . , 0) = 0. (Notice that, as F is continuous over a compact set, lim r→(0,...,0) G F (r) = 0.) Definition 6. An entropy H = (η, F) ∈ H is said to be channel-supermodular if G F is supermodular. The set of channel-supermodular entropies is noted by S ⊂ H.
The motivation for defining channel-supermodularity in terms of G F might seem arbitrary, but it is justified for its relationship with conditional entropies, given by Together with (2), the supermodularity of G F can be a powerful tool for deriving results regarding conditional entropy and mutual information for entropies in S.

Examples of Channel-Supermodular and Non-Channel-Supermodular Entropies
In the next sections, we will study the implications of channel-supermodularity. The inequality in Definition 4 implies interesting behaviors regarding H-mutual information, as will be seen in Section 4. This property has immediate consequences for channel ordering and channel design, as will be explored in Sections 5 and 6.
One appealing aspect of channel-supermodularity is that some of the most commonly used entropies in the literature belong to S, including Shannon and min-entropy, and more generally the Arimoto-Rényi entropies. In this section, we prove that these (and other entropies) indeed belong to S, and provide examples of entropies that do not. First, we state a useful characterization of supermodular functions, which is an immediate consequence of Corollary 2.6.1 in [49].
Let φ : R n ≥0 → R and let e 1 , . . . e n denote the canonical basis of R n . The function φ is supermodular if and only if, for all r ∈ R n + , all δ 1 , δ 2 ≥ 0 and all i, j with i = j, Moreover, if φ has second partial derivatives, φ is supermodular if and only if, for all r ∈ R n + and all i, j with i = j, The property characterized by Equations (3) and (4) is known in the economics literature as increasing differences [49]. This name is due to the effect an increase on a coordinate has on the value of φ being monotonically increasing with regard to the other coordinates. This is readily noticeable if we rearrange the terms of (3): That is, the change of φ prompted by an increase of δ 1 in coordinate i is greater the greater the value of coordinate j. Equation (3) is thus just the statement that on the lattice (R n ≥0 , ) increasing differences and supermodularity are equivalent concepts ([49] Corollary 2.6.1).
Using this result, as well as appealing directly to Definition 4, we now prove channelsupermodularity for a number of commonly used entropies. Throughout, given r ∈ R n ≥0 , we denote by r i its ith coordinate. Proposition 2.

3.
For any k, the k-tries entropy is channel-supermodular. In particular, min-entropy is channelsupermodular.

Proof. See Appendix A.
Items 3 and 4 of Proposition 2 are of particular interest to security applications, as guessing entropy and k-tries entropies have found interesting applications in the field of quantitative information flow.
Guessing entropy is especially useful in scenarios modelling brute-force attacks, as it models the expected number of attempts necessary for an adversary to obtain the value of a secret when trying one by one. On the other hand, k-tries entropy reflects the probability of guessing a value correctly when k guesses are allowed (see e.g., [19] (Section III.C)). It is defined as and can be readily seen to be core-concave by taking η(x) = − log(−x) and F(p) = − ∑ k i=1 p [i] . Notice that min entropy is equal to H k−tries when k = 1. The results in Proposition 2 justify our interest in channel-supermodularity, as any property derived for entropies in S will also hold for this set of commonly used entropy measures. However, not all entropies are channel-supermodular.
This includes another interesting entropy family useful in security, the partition entropies [19]. Let P be a partition of the set {1, . . . , n}. The partition entropy with regard to P is given by It is easy to see that H P is core-concave, by taking η(x) = − log(−x) and F(p) = − max A∈P ∑ i∈A p i . The partition entropy H P is useful for capturing the uncertainty of an adversary that is interested in knowing only to which subset the realization of X pertains. This is an appropriate model for adversaries that are interested in obtaining some specific partial knowledge about some sensitive information (e.g., obtaining the home town or the DOB of a user).

1.
Hayashi-Rényi, Tsallis and Sharma-Mittal entropies (with conditional forms as in Table 1) are not channel-supermodular for all α > 1 whenever the input set is of size greater than 2. Moreover, they are also not channel-supermodular for all α ∈ (0, 1) for some size of input set.

2.
The partition entropy is not, in general, channel-supermodular.

Proof. See Appendix A.
Notice that, for some choices of P, H P is channel-supermodular. In particular, if P = {{i} | i ∈ {1, . . . , n}}, H P coincides with min-entropy.
A generalization of partition entropy is the weighted partition entropies, is a set of weights. Being a generalization of partition entropies, weighted partition entropies are also not channel-supermodular in general.

The Join-Meet Operator and a New Structural Order
In this section we address the claim made in Section 1, proving that the Join-Meet operation is monotonic with regard to conditional entropy for all channel-supermodular entropies.
Let K : X → Y be a channel, with Y = {y 1 , . . . , y m }, and let K i be the column of K corresponding to output y i . Define, for i = j, the Join-Meet operator 3 i,j as follows: The next result proves that the Join-Meet operator is monotonic with I H if H ∈ S.
Theorem 4. For all channels K 1 and all i, j, K 1 ≥ S mc 3 i,j K 1 .
Proof. Let H= (η, F) ∈ S and define G F as in Definition 5. Let K 2 = 3 i,j K 1 , and denote by Y 1 , Y 2 the outputs of K 1 , K 2 . Notice that, for any distribution on the input, where the inequality follows from G F being supermodular. From Equation (2) and η being increasing, it follows that H( In light of Theorem 4, one might wonder if the Join-Meet operator completely defines S-that is, whether H ∈ S whenever K ≥ H mc 3 i,j K for all channels K and all i, j. In fact, an even stronger statement can be made by only considering a subset of channels. Definition 7. Let K (k,l, 1 , 2 ) denote the channel with input alphabet {x 1 , . . . , x n } and output alphabet {y 1 , y 2 }, given by Proof. We prove the contrapositive. Suppose that H = (η, F) ∈ S. Then, from (3), there are r = (r 1 , . . . , r n ) ∈ R n ≥0 , i, j ≤ n with i = j and δ 1 , δ 2 > 0 such that Let γ = (2 r 1 + δ 1 + δ 2 ) −1 and define a probability distribution over X by , and and thus, as η is strictly increasing, H(X|Y 1 ) > H(X|Y 2 ), which concludes the proof.
An immediate consequence of Theorems 4 and 5 is that the Join-Meet operator completely characterizes S.
and only if, K ≥ H mc 3 i,j K for all channels K and all i, j.

A New Structural Ordering
Theorem 4 yields some immediate new results for reasoning about channel ordering, as, whenever |X | > 2, the Join-Meet operator is not, in general, captured by the degradedness ordering. Consider, for instance, the following channels K 1 ,K 2 (notice that Then, we have that Then, we have Therefore, K 2 = K 1 R for any choice of p, q. We formalize this observation in the next result.

1.
If |X | = 2, then, for all K : X → Y and all i, Proof. We first prove (1). As it is possible to reorder columns by degrading a channel, without loss of generality let i = 1 and j = 2. Fix K : X → Y. Let K kl = K(y l |x k ), and suppose, again without loss of generality, that K 11 ≥ K 12 and K 22 ≥ K 21 . If K 11 = K 12 or K 22 = K 21 , then 3 1,2 K is obtainable by permutating columns of K, and therefore K ≥ d 3 1,2 K. Otherwise, we have K 11 K 22 − K 12 K 21 > 0, and 3 1,2 K = KR where R is the following channel: For the proof of (2), it suffices to notice that, whenever |X | > 2, K (1,2,0,0) ≥ d 3 1,2 K (1,2,0,0) . The proof for general |X | is along the same lines of the argument after (5).
Next we define the channel-supermodularity preorder over channels, which is based on the Join-Meet operators 3 i,j .
An induced preorder can be then defined by combining ≥ d and ≥ s as follows:

Relations between Preorders for Channel-Supermodular Entropies
Throughout this section, let K 1 : X → Y and K 2 : X → Z. First, note that Proposition 1 and Theorem 2 are still meaningful under S. The next proposition summarizes the relationship between (≥ ds ) and the other preorders.
Proof. See Appendix B.
Proposition 5.7 can be used to decide whether K 1 ≥ H mc K 2 , for H ∈ S, by only using structural properties of the channel. Consider, for example, the following channels K 1 , K 2 .
In [19], the authors claimed to have no proof that K 1 ≥ H 1 mc K 2 . By Proposition 5 (7), K 1 ≥ H 1 mc K 2 can be proven as follows:  Notice that if H is substituted for S, Theorem 3 does not hold: For such channels, Proposition 5 (7) implies K 1 ≥ S mc K 2 , and the first result follows. The second result then follows by noting that Proposition 1 and Theorem 2 imply Results on Channel Capacity We can use Theorem 4 to prove similar results for the preorder ≥ shs , which is an extension of ≥ sh with ≥ s .
In the remainder of this section, we prove that K 1 ≥ shs K 2 is a sufficient condition for establishing that both the Shannon and min-capacity of K 1 are at least as large as that of K 2 .
Proof. See Appendix B. Figure 1 summarizes the implications between orderings. As can be seen, there are some open questions that need to be established, designated by dotted lines with a question mark. Note that the absence of an arrow means that the implication is known to be false. ? Figure 1. The four implication graphs summarize the relation between preorders. It is known that a preorder ≥ i implies a preorder ≥ j if and only if there is a path of solid arrows from ≥ i to ≥ j . Preorders that are equivalent are grouped together, and the dotted arrows represent an implication whose validity is an open question. When no path is present, the implication is known not to hold.

Channel Design
Core-concavity was originally introduced in [51] in the context of universally optimal channel design-that is, the problem of finding, given some operational constraints, a channel leaking the minimum amount of confidential information (optimality), for all entropy-based measures of leakage (universality). This section shows how core-concavity and channel-supermodularity can be used in this context.
If X and Y are the input and output of a channel and H is a core-concave entropy, then the leakage about X through Y as measured by H is defined to be the H-mutual information I H (X; Y), as in Definition 3.
The concept of leakage is relevant in security/privacy contexts where X is some confidential data, K is modelling a system (e.g., a cryptographic computation or a database query system), and Y some observable generated by the system (e.g., the computation time or the result of a statistical query). Different H corresponds to different attackers' models, and universal channel solutions identify countermeasures to leakage which are robust with regard to all attackers in that universe.
Minimizing leakage of sensitive information is usually a desirable goal. However, when designing systems, it is often the case that some leakage is unavoidable. With that in mind, some recent works in QIF aimed at obtaining channels that leak the least amount of information subject to some operational constraints [13,28,29]. From Definition 3, the problem can be rephrased as finding the channel which, subject to some operational constraints, maximizes H(X|Y) -or, as η is increasing, maximizes ∑ y p(y)F(X|y). In a recent work [29], which considered a generalizing framework for these operational constraints, it was shown that this problem can be solved by convex optimization techniques, for a given core-concave H. However, it was also proven that the solution to the problem is in general not universal-that is, the optimal channel given a set of constraints may vary with the choice of H.
Despite this negative result, it was shown in [29] that some classes of problems admit a universal solution. As different entropies model different attackers, these results provide a very strong security guarantee-namely, that the optimal system in these situations is the most secure possible regardless of the attacker model. In the next few sections, we show how channel-supermodularity can be a useful tool in obtaining solutions that, while not universally optimal for all core-concave entropies, are the most secure for all symmetric entropies in S.

Deterministic Channel Design Problem: A Universal Solution by Channel-Supermodularity
In many applications, such as repeated queries, it is either undesirable or impractical to consider a "probabilistic" system. This motivates the study of the channel design problem restricted to deterministic channels, which has been recently investigated in [13].
It was proven in [13] that, similarly to the general channel design problem, the deterministic version does not in general accept a universal solution. Moreover, the problem was also shown to be, in general, NP-hard. However, it was also proven in this work that a specific class of problems, called the deterministic complete k-hypergraph design problem, admit a solution that is optimal for all symmetric channel-supermodular entropies. This problem can be defined as follows.

Definition 11.
Let H ∈ H, k ∈ N >0 and let X , Y be finite sets with |Y | ≥ |X | /k The deterministic complete k-hypergraph design problem (CKDP) is to find a channel K : X → Y that maximizes H(X|Y), subject to the following constraints: • ∀x, y, K(y|x) ∈ {0, 1}, and • ∀y, That is, the deterministic CKDP is the problem of finding the most secure deterministic channel, subject to the constraint that each output can only be generated by at most k inputs.
For the remainder of this section, let k ∈ N >0 , and fix a distribution p X such that, without loss of generality, The greedy solution proposed by [13] is described in Algorithm 1. The algorithm is straightforward: it associates the k most likely secrets with the first observable; it then associates the k most likely secrets among the remaining secrets with the second observable, and so on. The solution for X = {x 1 , . . . , x 8 } and k = 3 is depicted in Figure 2.
Algorithm 1 Greedy algorithm for the k-complete hypergraph problem Input: Input set X , prior p X and integer k ≤ |X | Output: Matrix of optimal deterministic channel K k 1: initialize: K k as a matrix of 0s, with |X | rows and |X | /k columns. put: Matrix of optimal deterministic channel K k initialize: K k as a matrix of 0s, with |X | rows and |X | /k columns. for i ∈ {1, . . . , |X |} do K k (i, i /k ) = 1 return K k K k y 1 y 2 y 3 x 1 1 0 0 x 2 1 0 0 x 3 1 0 0 x 4 0 1 0 x 5 0 1 0 x 6 0 1 0 x 7 0 0 1 re 2. The solution given by Algorithm 1 for k = 3. orem 6 ([13]). Given a complete k-hypergraph channel design problem, the solution given by rithm 1 is optimal for any symmetric channel-supermodular entropy.
f. We reproduce the proof of this theorem from [13], as it provides an interesting aption of channel-supermodularity. Let's consider the following joint matrix J k obtained lgorithm 1 and the prior: We now prove that any matrix J satisfying the constraints can be transformed into J k sequence of steps each increasing (or keeping equal) any supermodular entropy. Each consists of the following three sub-steps: Select two columns c i , c j and align the non zero coefficients in c i , c j ; Perform ∧, ∨ operations on the aligned columns and replace c i , c j with c i ∨ c j , c i ∧ c j ; Dis-align the two columns c i ∨ c j , c i ∧ c j .
The following example illustrates one step (i.e., the three sub-steps above): . Given a complete k-hypergraph channel design problem, the solution given by Algorithm 1 is optimal for any symmetric channel-supermodular entropy.
Proof. We reproduce the proof of this theorem from [13], as it provides an interesting application of channel-supermodularity. Let us consider the following joint matrix J k obtained by Algorithm 1 and the prior: We now prove that any matrix J satisfying the constraints can be transformed into J k by a sequence of steps each increasing (or keeping equal) any supermodular entropy. Each step consists of the following three sub-steps: • Select two columns c i , c j and align the non-zero coefficients in c i , c j ; • Perform ∧, ∨ operations on the aligned columns and replace c i , c j with c i ∨ c j , c i ∧ c j ; • Disalign the two columns c i ∨ c j , c i ∧ c j .
The following example illustrates one step (i.e., the three sub-steps above): replaced c 2 , c 3 with c 2 ∨ c 3 , c 2 ∧ c 3 ; • disaligned columns 2 and 3, that is, position values in c 2 ∨ c 3 , c 2 ∧ c 3 , so that each row has the same probability it had before the step.
Notice that aligning (and disaligning) is a permutation of a column; hence, they do not change the value of the posterior of symmetric channel-supermodular entropies because G F (c i ) = G F (c i ) for any permutation c i of the column c i .
Next, for the remaining sub-step, where we replace c i , c j with c i ∨ c j , c i ∧ c j , by super- ; hence, that sub-step increases (or keeps equal) the posterior entropy. Notice also that the matrix at the end of the step has in each row the same probabilities as it had before that step; hence, it is still a joint matrix that respects the complete k-hypergraph constraints for the same prior.
The selection and alignment of columns is as follows: at the initial step select c i such that c i contains the first r elements with the highest probabilities, say p 1 , . . . , p r ; if r < k, then select c j as the column containing p r+1 ; align c i , c j so that p r+1 is not on the same row as any of the p 1 , . . . , p r (and c i ∨ c j has no more than k non-zero terms). Then, c i ∨ c j will contain p 1 , . . . , p r+1 . Repeat until r = k. Then repeat the process considering the probabilities p k+1 , . . . , p n .
By repeating these steps, we will reach a matrix J with columns c 1 , . . . , c n such that each element of column c i has higher probability than all elements of column c i+1 . This is exactly the solution given by the greedy algorithm (modulo column permutations), that is, If H is not channel-supermodular, the greedy solution may not be optimal. Consider, for example, the Hayashi-Rényi entropies, which are not channel-supermodular. Let X = {x 1 , . . . , x 4 }, p X = (0.3, 0.3, 0.2, 0.2) and k = 3, and consider the channels K 1 , K 2 with outputs Y 1 , Y 2 below.
Then, K 1 is the greedy solution. However, for Hayashi-Rényi entropies, the following limit holds [52]: Thus, H α (X|Y 1 ) < H α (X|Y 2 ) for large enough α, and the greedy solution is not optimal. Another example of core-concave functions for which Algorithm 1 is not optimal is provided by "partition" entropies. For example, if B is a partition of the possible values of X then is a core-concave entropy which is not channel-supermodular.

An Application to Query Anonymity
Let us consider the following anonymity mechanism problem: we want to design an anonymity mechanism where in order to conceal a secret query from an eavesdropper, the user sends to a server a set of k queries which includes the secret query. Then, once received from the server the response to all the k queries, the user retrieves the response to the secret query. In our setting, this corresponds to each observable having a pre-image of size exactly k.
As an illustrative example consider a Twitter user who wants to visit some other Twitter user page but wants to keep this query secret. To solve this problem, he decides to use the following protocol: whenever he visits the desired user page, he also sends k − 1 other queries to the pages of other Twitter users. Suppose further that this user frequently visits this user's page, meaning that a random choice of the other queries is not a wise strategy, since multiple observations would end up revealing more and more information about the query, eventually completely revealing the secret query. The problem is then: which set of k Twitter pages will leak the least information about the user secret query?
We assume the attacker has no background information about the user, and hence we set the probability of a Twitter query for that user as the probability that a general member of the public requests that Twitter page (a good proxy to this measure can be derived by the number of followers of that Twitter page). Let n be the number of possible queries (i.e., the input set). Considering the scenario that n is divisible by k, we can use Algorithm 1 to solve this problem.
Notice that for n secrets, there are n!( n k !) −1 (k!) − n k possible ways to satisfy these anonymity constraints. For example, there are about 7 × 10 85 possible solutions when n = 100 and k = 10, and about 4 × 10 19,704 for n = 10,000 and k = 100. We will now compare the greedy algorithm in Section 6 against other possible anonymity solutions, and we will measure the goodness of the solutions using min and Shannon posterior entropies. Let us consider the three anonymity solutions below: 1.
the solution from the greedy algorithm (Algorithm 1) (i.e., pick the k − 1 queries closest in probability to the real query); 2.
a non-optimal solution where the secrets with the highest probabilities, instead of being grouped in the first bin, are distributed in the other bins. The difference between these solutions can be very substantial. Figure 3 shows the values when the distribution over the input set is a binomial distribution, with parameter p = 0.5: in this scenario, supposing that there is a universe of 350 Twitter pages and that the user sends 19 queries selected using the greedy algorithm, the probability of an attacker guessing the secret query correctly would be over 7 times smaller than if the user had opted instead for the non-optimal solution (using 2 −H ∞ (X|Y) as the conversion formula from posterior min entropy to probability of guessing). In fact, it is easy to define an input probability distribution such that the leakage gap between the non-optimal solution and the optimal solution given by the greedy algorithm is arbitrarily large. Note that, by Theorem 6, the greedy solution is in fact optimal for all channelsupermodular entropies. Hence, the user knows that the greedy solution is optimal against an attacker trying to guess his secret query in a fixed number of guesses, or using guesswork or guessing using a twenty-questions-style guesswork (reflecting Shannon entropy), and so on.

Query Anonymity for Related Secrets
Consider the scenario where a user who wants to query the Twitter page of some political commentators is at the same time interested in hiding his own political affiliation, which could be leaked by his queries. In this scenario, the solution from Algorithm 1 might be sub-optimal. To see this, suppose that the k queries in the real query's cover end up being all affiliated to the same party. This would thus reveal the user's political party to the attacker with certainty even though the real intended query is still uncertain. This is not a contradiction to the optimality of the algorithm. As such, an adversary would be better modeled by a partition entropy-with political commentators (the queried users) grouped by party affiliation-and, as established in Proposition 3, this type of entropy is not channel-supermodular in general.
Motivated by this scenario, we now give an optimal solution for all channel supermodular entropies (even if not symmetric) to deal with this kind of problem. Suppose there are k political parties, and l commentators aligned with each party. Suppose, further, that the user is affiliated to one of the parties, and would like to check the profiles of his party's commentators without revealing his own affiliation.
To achieve this aim, the user decides to group the political commentators in covers of size k, each cover containing exactly one commentator from each party, and then proceed to use these covers similarly to the mechanism described in Section 6.2, by querying the entire cover (fetching the pages of all the commentators in the cover). The question is: what is the set of covers that reveal the least amount of information about the user's political affiliation? Let X = {P 1 , . . . P k } be the set of parties, Y = {c 1 1 , . . . , c l 1 , c 1 2 , . . . , c l 2 , . . . , . . . , c 1 k , . . . , c l k } be the set of commentators, wherein c j i is the jth commentator of the ith party, and let Z ⊂ 2 Y be the set of covers. Let K : X → Y be the channel giving the conditional probability of a user choosing to query a commentator given the user's party inclination. We assume that the user only chooses commentators that share the same affiliation as his, that is, K(c i j |P m ) = 0 whenever j = m. For simplicity, we assume that the commentators are organized decreasingly with regard to this probability-that is, We claim that the optimal mechanism, for all channel-supermodular entropies (with regards to the political parties, not the commentators), is to group the most popular commentators for each party in the first cover, then the second-most popular commentators for each party on the second, and so on. That is, the covers would be: {{c 1 1 , c 1 2 , . . . , c 1 k }, {c 2 1 , c 2 2 , . . . , c 2 k }, . . ., {c l 1 , c l 2 , . . . , c l k }}. Let R : Y → Z be the channel mapping each commentator to their cover, modelling the optimal solution described above. The matrix of this channel can be seen as a vertical concatenation of identity matrices: The channel KR : X → Z above is then obtained by postprocessing K by R.
The claim of optimally can be more formally stated as follows: given any channelsupermodular H, any distribution p X over the parties, and any other deterministic covering R : Y → Z in which there is exactly one commentator from each party per cover, the resulting channel KR will never leak less information than the channel KR in (7). That is, The proof that the covering R above is indeed optimal for all channel-supermodular entropies is similar to the proof of Theorem 6, but even simpler. Suppose the channel R : Y → Z is any covering that satisfies the restriction that each cover has exactly one commentator from each party. Now, consider the channel KR , and proceed as follows: first, do the Join-Meet operation of the first column with all the other columns (that is, obtaining the channel 3 1,l 3 1,l−1 . . . 3 1,2 (KR ) ). Now, disregarding the first column, the process is repeated for the second column with all the remaining ones. It is easy to see that the resulting channel will be exactly KR in (7).
As an example, let k = 3, l = 3, and suppose the non-zero values of K are: which is exactly the optimal solution given by (7).

Conclusions
In this work, we introduced the notion of channel-supermodular entropies, as a subset of core-concave entropies, which include guessing and Arimoto-Rényi entropies. We demonstrated that, for this new classification of entropies, the Join-Meet operator on channel columns decreases the H-mutual information. This property prompted us to define structural preorders of channels (≥ ds , ≥ sds ), providing novel sufficient conditions for establishing whether two channels are in the H-more capable any channel-supermodular H or in the (H 1 , H ∞ )-capacity ordering. Moreover, this work establishes some relationships of these new structural preorders with other existing preorders from the literature.
As an example application, we used channel-supermodularity to prove an optimality result of a greedy query anonymization algorithm.
It is our belief that the connection between supermodular functions and some commonly used entropy measures, made in Section 3.1, will prove useful for posterior investigations in information theory (for example, given the vast literature of supermodular functions over Euclidean space [50] (Chapter 6.D) and [49]). Further directions of work include investigating other useful properties of channel-supermodular entropies, and further applications of channel-supermodularity to anonymity.
Author Contributions: All authors contributed to the results of this paper. All authors have read and agreed to the published version of the manuscript. Suppose r ∈ R n >0 . Then, for all i, j with i = j we have Supermodularity of G F restricted to r ∈ R n >0 then follows from (4). If, however, r ∈ R n ≥0 \ R n >0 , the second derivative may not exist. In such a case, supermodularity can be established by a limiting argument. Let > 0 and r = r + (∑ k e k ).
Then (4) and (A1) imply that As G F is continuous, taking → 0 in both sides of the inequality above, we obtain Thus, G F is supermodular following (3).
(2) For Arimoto-Rényi entropies of order α ∈ (0, 1), we have and similarly, G F (r) = − r α for entropies of order α > 1. Now, for all r ∈ R n >0 and i, j with i = j, we have which is negative if α > 1 and positive if 0 < α < 1. Thus, from Equation (4), G F restricted to R n >0 is supermodular for all α ∈ (0, 1) ∪ (1, ∞). Supermodularity of G F can then be established by a limiting argument similar to the one in the proof for Shannon entropy.
Let r = r + δ 1 e i , r = r + δ 2 e j and r = r + δ 1 e i + δ 2 e j . The inequality above is equivalent to If r i + δ 1 is not amongst the k largest elements in r , then the right-hand side of (A2) is equal to 0, and there is nothing to prove. If r i + δ 1 is amongst the k largest elements in r , then it is also amongst the k largest elements in r , and either: • r i is amongst the k largest elements in r and r , and thus both sides of (A2) are equal to δ 1 ; • r i is amongst the k largest elements in r but not in r . In which case the left-hand side of (A2) is equal to δ 1 , and the right-hand side of (A2) is less than δ 1 ; • r i not amongst the k largest elements in r and neither in r . In this case the left-hand side of (A2) is larger then the right-hand side, as the kth largest element of r is greater or equal than the kth largest element of r.
Let i, j be such that i = j and let δ 1 , δ 2 > 0. Suppose, without loss of generality, that r i + δ 1 ≥ r j + δ 2 . We will prove that Let I r+δ 1 e i = {k ≤ n | r j < r k < r j + δ 2 }-that is, I r+δ 1 e i is the set of coordinates of r + δ 1 e i whose value is between r j and r j + δ 2 .
Let m − 1 be the number of entries of r + δ 1 e i + δ 1 e j that are greater than or equal to r j + δ 2 . Then, there are (m + |I r+δ 1 e i | − 1) coordinates of r + δ 1 e i that are strictly greater than r j .
The difference G F (r + δ 1 e i + δ 2 e j ) − G F (r + δ 1 e i ) is then m(r j + δ 2 ) − (m + |I p+δ 1 e i |)r j , plus the sum of the entries given by the set I p+δ 1 e i , as the integers by which they are multiplied are decreased by one. As we assume r i + δ 1 ≥ r j + δ 2 , i ∈ I p+δ 1 e i , and the value of the right hand-side of (A3) reduces to We now turn to the right-hand side of (A3), the difference G F (p + δ 2 e j ) − G F (p). We define I p similarly to I p+δ 1 e i and divide the proof in three cases.
Case 1 (r i ≥ r j + δ 2 ): In this case, there are m − 1 entries of r + δ 2 e j greater or equal to r j + δ 2 , and (m + |I r | − 1) entries of r strictly greater than r j . Moreover, I r = I r+δ 1 e i , and we obtain Case 2 (r j < r i < r j + δ 2 ): In this case, there are m − 1 entries of r + δ 2 e j greater or equal to r j + δ 2 , and (m + |I r | − 2) entries of r strictly greater than r j . This time, I r = I r+δ 1 e i ∪ {i}, and we have ≤ − |I r+δ 1 e i |r j + mδ 2 + ∑ k∈I r+δ 1 e i r k =G F (r + δ 1 e i + δ 2 e j ) − G F (r + δ 1 e i ), where the inequality comes from the assumption that r i < r j + δ 2 . Case 3 (r i ≤ r j ): In this case, again there are m − 1 entries of r + δ 2 e j greater or equal to r j + δ 2 , and (m + |I r | − 2) entries of r strictly greater than r j . This time, I r = I r+δ 1 e i , and we have Therefore, G F is supermodular.
Case 2 (α > 1): Fix n > 2. Let α > 1, > 0 and pick a > 0 such that Notice that such a is guaranteed to exist because α > 1. Let r ∈ R n >0 be the vector with r 1 = r 2 = 1, r 3 = a and r k = for all k > 3. Then and (A4) does not hold. Therefore, G F is not supermodular.

Appendix B. Proofs of Section 5
Appendix B.1. Proof of Proposition 5 (1) follows immediately from Definition 9, and (2) from Proposition 4.2. Statement (3) is proven by the following channels: For (4), consider the channels K 1 , K 2 as follows: These channels were introduced in [3], where it is proved that K 1 ≥ H 1 ln K 2 . However K 1 ≥ ds K 2 , as To prove (6), consider the following channels, for some r ∈ (0, 1) and δ, > 0: Notice that K 1 and K 2 are indeed channels if δ and are small enough. These channels were introduced in [3] as an example for which (for δ, small enough) K 1 ≥ H 1 ln K 2 but K 1 ≥ d K 2 . As their input set is of size 2, Proposition 4.1 implies that K 1 ≥ ds K 2 . Statement (7) follows directly from Definition 9, Proposition 1, and Theorem 4. In this case, K 2 = 3 1,2 K 1 , and thus K 1 ≥ ds K 2 . Suppose, to derive a contradiction, that K 1 ≥ sh K 2 Then there is a finite collection {(g i , T i , R i )} such that each T i , R i is a deterministic channel, each g i > 0, ∑ g i = 1 and K 2 = ∑ i g i T i K 1 R i . Notice that, for all i, we must have (T i K 1 R i )(y 2 |x 1 ) = (T i K 1 R i )(y 2 |x 2 ) = 0.
Let I = {i | (T i K 1 R i )(y 1 |x 1 ) ≤ 1 /25}. We claim that for all i ∈ I, (T i K 1 R i )(y 1 |x 3 ) + (T i K 1 R i )(y 3 |x 2 ) + (T i K 1 R i )(y 3 |x 3 ) ≥ To see why, notice that channels of the form T i K 1 are those channels whose rows are equal to rows of K 1 (with possible reordering and duplicates), and channels of the form (T i K 1 )R i are those obtaining by permutating or summing columns of (T i K 1 ).
If the first row of T i K 1 is either ( 1 /2, 0, 1 /2) or ( 1 /3, 1 /3, 1 /3). Then, in order that i ∈ I, the third column of T i K 1 R i must be either 1) equal to the sum of the first and third columns of T i K 1 , 2) or equal to the sum of all columns of T i K 1 . In either case, we have (T i K 1 R i )(y 3 |x 2 ) ≥ 2 /3 and (T i K 1 R i )(y 3 |x 3 ) ≥ 2 /3, so (A6) holds.
If the first row of T i K 1 is (0, 1 /25, 24 /25), the proof is not as straightforward, and we divide it into two subcases. The first subcase is when the second row is either ( 1 /3, 1 /3, 1 /3) or ( 1 /2, 0, 1 /2), then each column of T i K 1 will be mapped to the first or to the third column of T i K 1 R i , and therefore (T i K 1 R i )(y 1 |x 3 ) + (T i K 1 R i )(y 3 |x 3 ) = 1. Then, (A6) follows from observing that (T i K 1 R i )(y 3 |x 2 ) ≥ 1 /3. The second subcase to consider is when the second row is (0, 1 /25, 24 /25). In this case, we have (T i K 1 R i )(y 3 |x 2 ) ≥ 24 /25, and, by considering each of the three possible choices for the third row of T i K 1 , it is easy to see that (T i K 1 R i )(y 1 |x 3 ) + (T i K 1 R i )(y 3 |x 3 ) ≥ 1 /2.