Optimal Channel Design: A Game Theoretical Analysis

This paper studies the problem of optimal channel design. For a given input probability distribution and for hard and soft design constraints, the aim here is to design a (probabilistic) channel whose output leaks minimally from its input. To analyse this problem, general notions of entropy and information leakage are introduced. It can be shown that, for all notions of leakage here defined, the optimal channel design problem can be solved using convex programming with zero duality gap. Subsequently, the optimal channel design problem is studied in a game-theoretical framework: games allow for analysis of optimal strategies of both the defender and the adversary. It is shown that all channel design problems can be studied in this game-theoretical framework, and that the defender’s Bayes–Nash equilibrium strategies are equivalent to the solutions of the convex programming problem. Moreover, the adversary’s equilibrium strategies correspond to a robust inference problem.


Introduction
A channel is defined as a conditional distribution, modelling the probability of outputs that an adversary can observe given secret inputs. Important examples of channels are side-channels in computer security where an attacker, for example by observing the running time of an encryption program, can reconstruct the encryption keys.
At a high level, the problem of optimal channel design is the following: given a prior on the secret and some operational constraints, design a channel that minimises the leakage of information about the secret. In simple terms, an optimal channel can be seen as an optimal countermeasure to information leakage.
To explore this design problem, one needs to specify what constraints should be considered and how the leakage of information is quantified. In the cryptographic example above, one may want, for example, to design a channel of minimal leakage (in terms of the number of key bits that can be reconstructed by an adversary) under the constraint that the average encryption per block should take less time than some given duration. This work will consider two general classes of constraints which we refer to them as "hard" and "soft". Hard constraints are the ones establishing which outputs are allowed given each inputs. These constraints must be satisfied for each realisation of input-output pairs. Soft constraints, on the other hand, must be satisfied in the expected value sense, as they relate to the expected utility of the channel.
Leakage of information is defined as the difference between the adversary's prior and posterior uncertainty, i.e., the uncertainty before and after observing the outputs of the channel. The leakage quantifies how much the attacker can learn about the secret input from observing the outputs.

Literature Review
The problem of information leakage outside of the communication setting has been studied in the quantitative information flow (QIF) literature [3][4][5][6], works on private information retrieval (PIR) [7], and private search queries [8,9], as well as research on privacy-utility trade-offs [10][11][12]. Particularly important from the field of QIF are advances on fundamental security guarantees of leakage measures (what security can be achieved) and robust techniques and results (how much a technique or result is valid across different notions of leakage). However, most of the theoretical effort has been focused on analysing a given system as opposed to a design problem.
Information leakage in the context of game theory has been studied in Reference [13]. Their work focuses on modelling the interplay between attacker and defender in regard to information leakage of given channels, and to reason about their optimal strategies. In contrast our focus is on the design of optimal channels within operational constraints.
The authors in Reference [14] also use a zero-sum game between a forecaster against Nature to show that the celebrated maximum entropy principle in statistics, i.e., that one should choose a distribution that has the highest entropy from a family without any further knowledge, is the dual of solving a robust Bayes decision problem. This was the inspiration for our duality connection in Section 5.2.
This work builds and extends on our two conference papers [1,2]. However, there are several differences compared to those papers. For example, we have now simplified the definition of core-concavity without loss of generality. In addition, the games in Reference [1] are different; e.g., they do not include soft constraints. Moreover, the connection of convex optimisation to a two-person game for "any" core-concave entropic leakage was not explored in either works. Finally, the relation of the dual problem, that of the adversary, to a robust information extraction problem is unique to this manuscript.

Contributions
The main contribution of this paper is to present the problem of designing optimal channels for minimum information leakage in a game-theoretical framework for a generalised class of quantifying leakage. In this way, the optimal channel design can be studied both from the defender (the channel designer) and the adversary (the inference maker, or the information extractor) point of view. The main technical contribution is Theorem 1, which shows that the convex programming solutions as in Reference [2] correspond to the defender's optimal strategies in these games. Moreover, this game-theoretical framework reveals that there is a tight duality relationship between the problem of designing a minimal leakage channel and choosing a "robust inference extraction" strategy. In particular, knowing only the specification of a channel given by some constraints and a prior distribution, the optimal strategy to extract the maximum amount of information about the input from the output of the channel, where the exact realisation of the channel is unknown, needs to be found. Hence, the strategy should be robust to any realisation of the channel within its constraints. When the Entropy 2018, 20, 675 3 of 20 game is finite, efficient solutions for both the defender and adversary's strategies can be found using linear programming.
This work also establishes a result to deal with uncertainty about the prior. By Theorem 2, it follows that, when the prior is not unique, but is known to depend on a hidden "context", the Nash equilibrium is not given by customising with respect to the context, but rather by treating the multi-prior problem as a single-prior one, where the prior is the average prior over all contexts.

Roadmap
After introducing notations and the information-theoretical background, including the important definitions of core-concavity and gain functions, the optimal channel design problem is presented in Section 3. It is then shown, in Section 4, that the problem is solved by convex programming for any entropy belonging to this generalised class.
The main contribution of the paper, i.e., the game-theoretical framework, is presented in Section 5. The games under study here are two persons sequential zero-sum games with asymmetric information. A notion of utility is introduced based on gain functions and soft constraints and the saddle-point equilibria are defined. The main result of this section, Theorem 1, shows the correspondence between equilibria and convex optimisation from Section 4. The section concludes with a discussion of the problem from the adversary point of view and its relation to robust inference.
In Section 6, our framework is extended to the case of uncertainty about the prior. It is first analysed as a convex optimisation problem, culminating in Theorem 2, which is followed by a discussion of the game-theoretical implication of that result.

Notational Conventions and Preliminaries
We will denote sets, random variables, and realisations with calligraphic, capital, and small letters, respectively, e.g., X , X, and x. We will denote the cardinality of a set X by |X |. For a vector p, we use p [i] to denote the i-th largest element of p, where ties are broken arbitrarily. In addition, we will use the notation p α for the α-norm of vector p, that is, p α := ∑ n i=1 p α i 1/α . The limit case of ∞-norm is p ∞ := p [1] . Let X represent the secret as a discrete random variable that can take one of the n possibilities from X := {1, . . . , n} with the (categorical) distribution of p X = (p X (1), p X (2), . . . , p X (n)) ∈ p(X ), where p(X ) is the probability simplex in R n . For the rest of the paper, as is the convention, we may omit the subscript X whenever it is not ambiguous and simply use p to refer to p X . Without loss of generality, assume that every secret has a strictly positive probability of realisation and that p(x)'s are sorted in non-increasing order; that is, p(1) ≥ p(2) ≥ . . . ≥ p(n) > 0.
A system that generates observable Y from the discrete set Y that can probabilistically depend on a secret can be modelled as a probabilistic discrete channel (henceforth referred to simply as a "channel") denoted by the triplet (X , p Y|X , Y ). Specifically, X and Y are the input and output alphabets, respectively, and p Y|X denotes the conditional probability distribution, also known as the transition matrix. That is, p(y|x) is the probability with which the channel produces the output (the observable) y given that its input (the secret) is x. In particular, they satisfy the following: In other words, the transition matrix is "row-stochastic". In the rest of the paper, we will use the terms secret and input, as well as observables and outputs interchangeably. Central to this work is the notion of leakage of information. In order to define leakage formally we will start by defining entropy and posterior (conditional) entropy in a general context.

Entropy
The classical choice for entropy and posterior entropy are (Gibbs)-Shannon's: where Y + is the set of outputs that have a strictly positive probability of realisation, that is Y + = {y ∈ Y | ∃x ∈ X , p(y|x) > 0}. In addition, p(y) is the (total) probability that y is observed by the adversary, i.e., p(y) = ∑ x ∈X p(x )p(y|x ), and p(x|y) is the posterior probability of the secret x given that y is observed as given by the Bayes' rule: p(x|y) = p(x, y)/p(y) = p(x)p(y|x)/p(y). However, as we mentioned in the introduction, there are many candidates for entropy. Some are more fitting for specific operational scenarios, such as Min-entropy and guesswork. A generalisation of Shannon and Min-entropy is the Rényi family, which itself is a special case of the Kolmogorov-Nagumo family. Rather than taking a specific entropy, we construct a general entropy from an axiomatic description.
Consider a random variable X whose distribution depends on the realisation of a "context" C, which is a binary random variable. In particular, p(c = 0) = α and p(c = 1) = 1 − α, with 0 ≤ α ≤ 1; moreover, p X|c=0 = p 1 and p X|c=1 = p 2 . Compare the following two scenarios: (1) we observe the realisation of the context and (2) we cannot see the realisation of the context. Intuitively, our uncertainty about X in the first scenario should be lower than that in the second. In particular, if we measure the uncertainty of a random variable with distribution p by function F(p), we should have ; that is, F should be a concave function. However, we note that this intuitive inequality still holds even if an increasing R → R function η(·) is applied to both sides; that is, The function η can be thought of as capturing our risk attitude. This motivates the following definitions.

Definition 1.
Let H be a function from probability distributions to R. Then we call H to be core-concave if we can write H(p) = η(F(p)), where η : R → R is strictly increasing and F is concave.
Throughout the paper, we will consider concave functions to also be continuous; specifically, their value on the boundaries are their limit values. Note that any concave function is also core-concave, by simply taking η(t) = t. However, the converse is not true. A notable example is the Rényi entropies: For α > 1, this function is neither concave nor convex (it is only pseudo-concave). However, it is core-concave. This can be shown as follows: For 0 < α < 1, core-concavity can be shown by η(t) = α 1−α log(t) and F(p) = p α . As another example, consider Sharma-Mittal entropies [15], defined as This family generalises Rényi H α,β→1 (p), Shannon H α→1,β→1 (p)), and Havrda-Tsallis entropies [16,17]: is also core-concave. This can be seen by In this paper, we take any function that is core-concave as a candidate for entropy.

Posterior Entropy
Motivated by the equivalence of our core-concave entropies with generalised induced entropies, we define the posterior entropy to take the following form: Note that the above definition is deliberately different from ∑ y∈Y + p(y)H(p X|y ). In particular, η is outside of the expectation. Now, the (information) leakage can be defined as The above structure of the posterior entropy is strongly motivated by the following result: For any core-concave H, leakage is non-negative.
Proof. Replacing from definitions, we have For a core-concave H, F is concave; hence, following Jensen's inequality, ∑ y∈Y + p(y)F(p(x|y)) ≤ F ∑ y∈Y + p(y)p X|y = F (p X ). Therefore, since η is a monotonically increasing function, we have η ∑ y∈Y + p(y)F p X|y ≤ η (F (p X )), i.e., leakage is non-negative.
In fact, our leakages satisfy a stronger property: The conditional entropy defined in Equation (4) satisfies the data-processing inequality (DPI).

Gain Functions and g-Leakage
An alternative foundational approach to information leakage is in term of gain functions. As we will use gain functions in our results, we give here a primer on this approach.
A classical interpretation for Shannon entropy is in terms of guessing a secret by asking set membership questions ("is the secret in set X?"). Often in the security community, another guessing model is more appropriate, which is individual guesses: "is the secret x"?
Information-theoretically, the individual guesses scenario is modelled by Min-entropy. This guessing scenario is, however, an all-or-nothing scenario: the attacker either guesses the secret or does not, and right guesses always yield the same reward. In many real world scenarios, however, even guessing part of the secret may be valuable, or guessing different secrets may yield different rewards. These scenarios have motivated the introduction of gain functions and g-vulnerability [18]. A gain function is a real valued function g whose arguments are an attacker guess and the secret: g(a, x) quantifies the gain of the attacker for guessing a when the secret is x.
The g-vulnerability is defined as the attacker expected gain for an optimal guess: where A is a countable set (the attacker guesses). From g-vulnerability, one can define posterior g-vulnerability by considering the average vulnerability over all possible outputs, i.e., Further derived notions are g-entropy and g-leakage. g-entropy is defined as the negative log of the vulnerability: − log V g (p). Similarly, posterior g-entropy is defined as the negative log of the posterior vulnerability: − log V g (p Y|X ). g-leakage is the difference between the g-entropy and the g-posterior entropy. An important property of gain functions, which we use in the game-theoretical analysis, is that any convex function can be defined using gain functions ( [19] (Theorem 5)).

Optimal Channel Design
The general setting in our paper is the following: Given a prior distribution on input (secret) variable X as p, we (the defender) would like to design a channel p Y|X within some operational constraints, such that the channel leaks minimally about the secret X through its output Y.
Let Ω ⊆ X × Y define the permissible outputs (observable) for each input (secret). Specifically, if (x, y) ∈ Ω, then, for input x, the designer cannot produce output y. This can represent the "hard" operational constraints on the channel. Hence, the channel, along with Equation (1), should satisfy: We will refer to Equation (6) as "hard" constraints, as they strictly forbid some input-output pairs "path-wise", that is, for each realisation of the input. As a consequence, an adversary can eliminate the forbidden inputs for an observable when making an inference. For ease of notation, for any given Ω, we will denote the space of channels that satisfy Equations (1) and (6) by Γ. That is, The design requirement for a legitimate channel that satisfies the hard constraints can now be expressly represented as p Y|X ∈ Γ. The naming of hard constraints is to contrast with the "soft" constraints, which are expressed in terms of an expected value. In particular, there are many interesting cases where it may be "feasible" to assign the same observable for all secrets, but such a move may result in a huge deterioration in the system's quality of the service (QoS). In such cases, the goal is to strike an optimal "balance" between information leakage and QoS. This is for instance the setting in geo-location privacy-utility trade-off [10,11,20] and secrecy-delay trade-off in bucketing as a defence against timing attacks [21,22].
In its most basic form, the QoS can be captured as an expected value of a "payoff" (desirability) function. In particular, let u : X × Y → R, where u(x, y) represents how good the realised output is for a particular input. Then the expected value of the pay-off is simply: ∑ x,y p(x)p(y|x)u(x, y), which can be a metric for the QoS of the channel. The channel design problem then becomes a "two-objective" optimisation: (a) minimising leakage and (b) maximising the QoS. The solution concept for multi-objective optimisations is of "Pareto-efficiency" (Pareto-optimality), which are the solutions with a guarantee that no alternative can simultaneously improve all of the objectives (at least one Entropy 2018, 20, 675 7 of 20 of them strictly). One of the standard methods of converting a multi-objective optimisation (MOO) to (a series of) single-objective optimisations (SOOs) is to present all but one of the objectives as inequality constraints. Specifically, we can introduce a lower threshold u min on the QoS by imposing: ∑ x,y p(x)p(y|x)u(x, y) ≥ u min . Then by varying the value of u min and solving the resulting SOOs, the Pareto-frontier (the set of Pareto-optimal solutions) will be found (see e.g., [23]). Hence, with this in mind, for the rest of the paper, we will be dealing with SOOs. We will refer to the constraint of ∑ x,y p(x)p(y|x)u(x, y) ≥ u min as the "soft" constraint, since it is expressed in terms of the expected value, distinguishing it from the "hard" constraints represented by Ω (or equivalently, Γ), for each realisation of the secret.
As we argued before, the aim is to design channels that have the lowest leakage of information about the input while satisfying a set of operational constraints, and the leakage is defined as the difference between the posterior and prior entropies. The first point to note is that the choice of the channel cannot change the prior entropy, as the prior entropy of the input is entirely governed by its prior distribution, which we assume is a "given" parameter that the defender cannot control. Therefore, the problem of minimising the leakage becomes equivalent to maximising the posterior entropy (equivocation).
Putting things together, the optimal channel design problem in its most general form becomes where the main notations are described in Table 1.

H(X)
(prior) entropy of the input, equal to η(F(p X )) where η is increasing and F is concave.

H(X|Y)
posterior entropy of the input, equal to η ∑ y p(y)F(p X|y ) for the same η and F.

Leakage
leakage of information about input by observing outputs, equal to H(X) − H(X|Y).
Before we get to our analysis, we present two minimalistic examples to instantiate the constraints. Note that each of these contexts of course have their idiosyncrasies that are abstracted away for the purpose of this paper. The first toy example is motivated by geo-location privacy. Figure 1 depicts four locations x 1 to x 4 , where the configuration is a representation of their relative positions. The defender is in one of these four locations and generates an observable, which can be its reported coordinates, based on which it receives a location-based service (LBS). Suppose, in particular, that x 1 and x 2 are near enough that the same observable can be reported for both of them, but x 1 is too far from x 3 and x 4 such that reporting the same coordinates with them is either infeasible (e.g., it will not get any network connectivity from an access point) or it will be unacceptable (the quality of the received utility will be too poor). Moreover, x 2 , x 3 , and x 4 are close enough to produce the same observable. If we label the observables simply by the subset of the secrets that can produce them, then the set of admissible secret-observable pairs, i.e., Ω, is ( This Ω determines the hard constraints on the problem, e.g., we must have p({x 2 , x 3 , x 4 }|x 1 ) = 0 because (x 1 , {x 2 , x 3 , x 4 }) ∈ Ω.
Entropy 2018, xx, x 8 of 20 will not get any network connectivity from an access point) or it will be unacceptable (the quality of the received utility will be too poor). Moreover, x 2 , x 3 , and x 4 are close enough to produce the same observable. If we label the observables simply by the subset of the secrets that can produce them, then the set of admissible secret-observable pairs, i.e., Ω, is ( This Ω determines the hard constraints on the problem, e.g., we must have p({x 2 , x 3 , x 4 }|x 1 ) = 0 because (x 1 , {x 2 , x 3 , x 4 }) ∈ Ω.
x 1 x 2 x 3 x 4 As another example, consider a minimalistic bucketing example depicted in Figure 2. The axis denotes time duration, and x 1 to x 4 represent the distinct execution times of four distinct (encryption or decryption) processes, i.e., Process 1 takes x 1 time to finish, and so on. If the result of each process is released immediately upon finishing, then they can be uniquely identified just by the timing "side channel". The result of a finished process can be deferred and released at a later time, to become identical to other processes that take longer to finish. This superset duration time constitutes a bucket. In the figure, the arrows represent whether a secret can be deferred till the finishing time of a longer processes. Specifically, suppose that the delay limitation for Process 1 does not allow it to be released as late as x 3 or x 4 . Therefore, the hard constraints can be identically represented as in the previous toy example.
Duration 0 x

Optimal Channel Design is Convex Programming
We now show that the problem of finding an optimal channel is a "convex optimisation" (also known as "convex programming" [24,25]). This is a useful result, because convex optimisations have desirable characteristics, e.g., many efficient algorithms for solving them exist (e.g., interior methods [25]). Moreover, any local optimum has the guarantee to also be a global optimum, so in particular any As another example, consider a minimalistic bucketing example depicted in Figure 2. The axis denotes time duration, and x 1 to x 4 represent the distinct execution times of four distinct (encryption or decryption) processes, i.e., Process 1 takes x 1 time to finish, and so on. If the result of each process is released immediately upon finishing, then they can be uniquely identified just by the timing "side channel". The result of a finished process can be deferred and released at a later time, to become identical to other processes that take longer to finish. This superset duration time constitutes a bucket. In the figure, the arrows represent whether a secret can be deferred till the finishing time of a longer processes. Specifically, suppose that the delay limitation for Process 1 does not allow it to be released as late as x 3 or x 4 . Therefore, the hard constraints can be identically represented as in the previous toy example.
Duration 0 x (Toy Example 2) The "secrets" are one of the four processes each with a distinct execution time x 1 to x 4 . The arrows denote which process can be deferred to be released at a later finishing time. For instance, Process 2 can be either released instantaneously, i.e., at x 2 , or deferred until x 3 , or until x 4 . In contrast, s 1 cannot be deferred as late as x 3 or x 4 .

Optimal Channel Design is Convex Programming
We now show that the problem of finding an optimal channel is a "convex optimisation" (also known as "convex programming" [24,25]). This is a useful result, because convex optimisations have desirable characteristics, e.g., many efficient algorithms for solving them exist (e.g., interior methods [25]). Moreover, any local optimum has the guarantee to also be a global optimum, so in particular any "descent" algorithm will necessarily converge to a global optimum. Additionally, in Proposition 4, we show that the Karush-Kuhn-Tucker (KKT) conditions fully describe the optimal channel (represent necessary and sufficient conditions of optimality). (7) for any choice of the pay-off and core-concave entropy functions is solved by convex programming. (7) can be simply ignored for both cases, since it is an increasing R → R function. Our optimisation variable is p Y|X ∈ R |X ||Y | . In particular, consider it as a |X ||Y | × 1 vector. All we need to show is that (a) the constraints of the optimisation define a convex subset of R |X ||Y | and (b) the objective function of the maximisation is concave in p Y|X .

Proof. η from Equation
Establishing (a) is simple: the constraint p Y|X ∈ Γ, which is equivalent to Equations (1a,b) and (6) trivially define a convex subset. The minimum expected utility constraint is also linear in p Y|X , where the coefficient of p(y|x) is p(x)u(x, y). Hence, the constraints of the problem define a convex subset of R |X ||Y | . In fact, they define a bounded polyhedron, as the feasible set is the intersection of half-spaces and it does not contain a whole line.
We establish part (b) by expressing H as a composition of a number of transformations that preserve concavity: • First affine transformation f i : projection of p(x, y) onto the sub-coordinate where y = y i , that is, the transformation (p(x j , y i )) i,j → (p(x j , y i )) j . Composition with an affine mapping preserves concavity/convexity.
• Second affine function g 1 : extension of a vector with its summation of elements, i.e., the transformation: x → (x, ∑ j x j ).

Now, we can write
Hence, H is concave in p(x, y).
As mentioned before, a fundamental property of convex optimisations is that any local optimum is a global optimum. In what follows, we establish another important property of the optimal channel design problems: that the Karush-Kuhn-Tucker (KKT) conditions provide both necessary and sufficient conditions for optimality. For an overview of the Lagrangian duality and KKT conditions the reader can consult with the rich literature on convex programming such as [24] (Chapter 5) and [26] (Chapter 28).

Proposition 4.
KKT conditions are necessary and sufficient for solving the optimal channel design problem described by Equation (7).

Proof.
We start by noticing that, in the most basic form, KKT conditions are expressed for cases where the function in the objective and constraints are "continuously differentiable", whereas some of our convex objective functions (e.g., in the case of min-entropy or guesswork) are piecewise linear. There is however a simple and standard translation from piecewise-linear convex functions into continuously differentiable functions by forming the epigraph problem [24] ( §5.2.5).
The proof is straightforward: all of our constraints are affine hence the KKT conditions are necessary-this is known as "Linearity Constraint Qualification" (LCQ). Moreover, since we showed that these problems are convex optimisations, the KKT conditions are also sufficient [24] ( §5.5.3).
The "Lagrangian" for the problem of Equation (7), denoted by L is: where the multipliers µ, γ are from the equality constraints and are therefore free (no sign constraint), whereas the multipliers λ, ρ pertain to inequalities constraints and are hence required to be positive (dual feasibility). The optimisation problem then becomes equivalent to solving the following KKT conditions: 1.
Vanishing first order derivatives of L with respect to each of the optimisation variables p(y|x), that is, ∇L = 0 (where ∇ is the gradient with respect to the (primal) variables p(y|x)). That is, for each p(y|x): ∂L ∂p(y|x) = 0.

Game-Theoretical Interpretation
We now present a game-theoretical framework for the general optimal channel design problem. The problem solution is shown to be a Nash equilibrium in a sequential zero-sum game. The main result proved in this section is a correspondence between any defender Nash equilibrium in these games and convex programming problems from Proposition 3. Moreover, when the game is finite, the solution can be found with linear programming and, hence, in a more efficient way than the general case. An important property of the game interpretation is that it provides not only the optimal channel design but also the attacker optimal attack strategy.
Consider the following two-player zero-sum game between a defender and an adversary: "Nature" chooses a realisation of a random variable X from the finite set X according to the publicly known probability distribution p. The defender, observes the realisation of x, and chooses an action from the finite set Y. Hence, the space of the pure strategies of the defender are all functions from X to Y, i.e., Y X . Each pure strategy of the defender corresponds to a deterministic channel. Similarly, a behavioural strategy of the defender corresponds to a probabilistic channel, p(Y|X), whose space is (∆Y ) X . The adversary, after observing y, makes a guess a from the countable (but potentially infinitely-sized) set A. Hence, the space of the adversary's pure strategies (deterministic plans of action) is A Y . A behavioural strategy of the adversary, designated by q(A|Y), assigns a potentially probabilistic guess to each output. Hence, the space of adversary's behavioural strategies is (∆A) Y . A pure and behavioural strategy profile of the game are respectively the pairs (d, a) ∈ (Y X × A Y ) and (p(Y|X), q(A|Y)) ∈ ((∆Y ) X × (∆A) Y ).
The payoff of the game can in general be represented by the (bounded) function v : X × Y × A → R. That is, the outcome of each instance of the game is that the adversary wins, and the defender loses, v(x, y, a) units; if the (realisations) of the channel input, the channel output and the adversary's guess have been x, y, and a, respectively. Let V represent the expected payoff of the game. The expectation is taken with respect to the random realisation of the input according to the prior p as well as any randomisation present in the strategies of the two players. Specifically, The defender wants to minimise V while the adversary wants to maximise it. Unlike the defender, the adversary does not observe the realisation of X; for this reason, this is a game of asymmetric information.

Nash Equilibria and Saddle-Point Strategies
A Nash equilibrium (NE) is a standard solution concept in game theory, which states that each strategy should be the best response assuming the strategy of the other player(s) is fixed. For two-player zero-sum games (2PZSGs), the set of NEs has a stronger interpretation-that of a saddle point. We first briefly describe this solution concept.
The defender may adopt the following worst-case scenario argument: assuming that any strategy that is adopted by the defender is going to be revealed to the adversary to best respond to it, the "robust" optimisation of the defender (the minimiser) becomes the following: We denote the value of the above optimisation with V to indicate that this is the highest expected payoff to the adversary. On the other hand, the best-case scenario of the defender is derived from the following argument: suppose the strategy of the adversary is given and the defender can design their strategy accordingly. Then this optimistic scenario for the defender (which is the worst-case for the adversary) leads to the following problem: Clearly, we have V ≤ V. If we have V = V = V * , we say the game has a value V * . Further, a saddle-point strategy pair (p * (Y|X), q * (X|Y)) is a strategy pair that satisfies the following: That is, a saddle-point strategy attains the value of the game: V * = V (p * (Y|X), q * (A|Y)). Then the argument for the saddle-point strategies as the solution concept of the 2PZSG is strong: the saddle-point strategy gives each player a guarantee of the utility no-matter what the other player's strategy is.
In what follows, we derive the condition for the saddle-point strategy of the defender and adversary, respectively. For the defender, a saddle-point strategy solves inf p(Y|X)∈Γ sup q(A|Y) V (p(Y|X), q(A|Y)). As before, let Y + be the set of outputs with a strictly positive probability of realisation. Since only these "on-path" outputs contribute to the expected payoff, we can rewrite Equation (9)  In particular, for each y, the adversary can put all the probability weight on an action that maximises the expected value of v(X, y, a) with X ∼ p X|y , where p X|y follows Bayes' rule. Note that, although we started from an agnostic stance, Bayes' rule turns out to be indeed the optimal belief update of the adversary. The saddle-point strategy of the defender hence solves the following optimisation: For the saddle-point of the adversary, we can rewrite Equation (9) as y, a).
Therefore, the best strategy of the defender for a given x is to put all the probability weight of p(y|x) of the y that achieves the smallest ∑ a∈A q(a|y)v(x, y, a) across all feasible y's for that x, i.e., Hence, the saddle-point strategy of the adversary comes from solving the following optimisation: We will consider the following payoff function: v(x, y, a) = g(a, x) − λu(x, y) where λ ∈ R + and g, u are real valued functions. This payoff function can be understood as a weighted difference between the gain of the attacker in guessing the secret and the utility of the channel. We will refer to such zero-sum game between a defender and an adversary as G (also G-game), which is specified by p, Ω, g(x, a), λ, u(x, y) . For such a game, the optimisation problem for saddle-point strategy of the defender in Equation (10) Theorem 1. For any optimal channel design problem in Equation (7), there is an induced game G, where the optimal channel is the saddle-point strategy of the defender. Conversely, for any game G, the saddle-point strategy of the defender is a solution to an induced optimal channel design problem.
Proof. We showed in Proposition 3 that the optimal channel design problem for any core-concave H is a convex optimisation. Since η is an increasing function, it can be removed from the optimisation without any effect. Now, from convex optimisation theory, we know that there exists a Lagrange multiplier λ ≥ 0 such that the solutions of the original optimisation matches those of the following Lagrange relaxation problem: Or equivalently, In particular, g F (a, x) can be constructed as follows: This follows from application of the supporting hyperplanes and a limit argument, as presented, e.g., in [19] (Theorem 5). Therefore, the optimisation can be written as Now, note that the minimisation is defining exactly the saddle-point strategy of the defender in a game G = p, Ω, g F (x, a), λ, u(x, y) as given in Equation (12). Now, for the reverse direction, consider the game G = p, Ω, g(x, a), λ, u(x, y) . The saddle-point strategy of the defender is a solution of the optimisation in Equation (12). Note that the sup a∈A ∑ x∈X p(x|y)g(a, x) characterises a convex function, or a negative of a concave function, which we call F g , i.e., let −F g (p X|y ) sup a∈A ∑ x∈X p(x|y)g(a, x).
With this notation, the saddle-point strategy of the defender solves Let a saddle-point strategy of the defender be denoted by p * (Y|X). Now consider the the following convex optimisation: where u min = ∑ x,y p(x)p * (y|x)u(x, y). We claim that these two convex optimisations are equivalent. To see this, note that the KKT conditions are necessary and sufficient for optimality in both optimisations. Moreover, if we take the λ in Equation (13) to be the Lagrange multiplier of the minimum utility constraint in Equation (14), these KKT conditions are exactly identical, except that Equation (14) has an additional complementary slackness condition: λ(∑ x,y p(x)p(y|x)u(x, y) − u min ) = 0. Since λ > 0, we should have, for an optimum of Equation (14), ∑ x,y p(x)p(y|x)u(x, y) − u min = 0, which holds for the saddle-point strategy by our specific choice of u min = ∑ x,y p(x)p * (y|x)u(x, y).
When the action-space of the adversary is finite, the saddle point strategies can be computed using linear programming. Specifically: Proposition 5. If the game has a finite number of pure strategies, then the saddle point strategies expressed by Equations (10) and (11) can be computed as the solution to the following linear program (LP) and its dual: Introducing variables u = (u x ) for x ∈ X , the dual of the above LP is y)) q(a|y), ∀(x, y) ∈ Ω q(a|y) ≥ 0, ∀a ∈ A, ∀y ∈ Y, ∑ a∈A q(a|y) = 1, ∀y ∈ Y.
Proof. In the first LP, the constraints v y ≥ ∑ x∈X g(a, x)p(x)p(y|x), ∀a ∈ A, and ∀y ∈ Y guarantee that, for each y, the optimisation chooses v y = max a∈A ∑ x∈X g(a, x)p(x)p(y|x); hence, the objective function becomes exactly as in Equation (10). Similarly, for the second LP, the constraints u x ≤ ∑ a∈A (g(a, x) − λu(x, y)) q(a|y), and ∀(x, y) ∈ Ω guarantee that the optimisation chooses u x = min y∈Y,(x,y)∈Ω ∑ a∈A (g(a, x) − λu(x, y)) q(a|y), which is exactly the optimisation problem of the adversary as in Equation (11).

The Adversary's Problem: Robust Inference
One important advantage of the game-theoretical analysis is that it connects the problem of the defender and attacker. Here, we provide a practical interpretation of the adversary's problem: Suppose we would like to extract information (i.e., infer) about X by observing Y. We know the prior over X, but we do not know p Y|X , i.e., the channel. All we know is that the channel has to respect some hard and/or soft operational constraints. What is the best inference about X in the absence of the channel? One approach is to consider the worst case among all possible channels that satisfy the constraints. The resulting "robust" strategy will have the minimum inference guarantee for any feasible realisation of the channel. The game-theoretical analysis reveals that the optimal channel design problem and the robust inference problem are equivalent; i.e., they are duals of each other.

Measure-Invariant Optimality
Notice that in all cases seen so far the optimal solution depends on the choice of entropy. There is, however, a particular case studied in Reference [1] where the optimiser is universal, i.e., is the same for all entropies: Proposition 6. (Theorem 1 in Reference [1]) When there are no soft constraints and the hard constraints are equivalent to just a size-cap of k on the pre-images of the outputs, there is a closed form solution for the Nash equilibrium. Moreover, this solution is universally optimal, i.e., it is optimal for any choice of entropy.

Uncertainty about the Prior
We have assumed that the input is realised according to a single distribution p that is known to the adversary. We now analyse the setting where the prior distribution of the input can be one of a number of possibilities, each happening with a known probability (a distribution over distributions. That is, the distribution of the input itself depends on a hidden random variable, which we refer to as the context. The adversary knows the joint statistics of the hidden context and the input, but does not get to observe the realisation of the context. At a high level, the main result of this section is the following: the best strategy for the defender is not to "customise" its strategy with respect to the context depending on the particular prior given each context, but rather to build an "averaged prior", and design the best strategy over this averaged prior and play it irrespective of the contexts. This result implies that the context-dependent optimal channel design problem reduces to an equivalent context-independent channel design problem over the mixed prior. This result may not be immediately intuitive, as there can be a counterargument as follows: Among the available priors, there are some particularly "good" ones, in the sense that they are very conducive to hide the secret (e.g., they are very close to uniform in a symmetric constraint setting). Then shouldn't the defender adopt the optimal channel for such priors in those contexts-especially if they have a high probability weight of occurrence? Our result refutes this intuitive argument.
To formalise the setting, let the space of the discrete random variable of the context be C = {c 1 , . . . , c |C| }. Without loss of generality, we assume that the context has full support, i.e., p C (c) > 0, ∀c ∈ C. The channel designer (the defender) knows the true distribution of the secret. Technically speaking, the defender "observes" the realisation of the context. The adversary, on the other hand, does not directly observe the context, but knows the probability of the realisation of each context, p C , as well as the (conditional) probability distribution of the secret given each context, p X|C . Note that knowledge of p C and p X|C is equivalent to the knowledge of the "joint" probability distribution of the context and the secret p X,C .
The adversary only sees the output Y and wants to "infer" about the input X. In the worst case, one can assume that the adversary knows p Y|X,C and hence, using his knowledge of p X,C , can use Bayes' rule to update his best belief about the secret after observing Y, i.e., constructing his posterior: Note that the defender is not directly interested in not leaking information about the context and only cares about X, but should be wary of how the adversary can use their information about the joint distribution of the context and input to intuit the input based on the observation. In addition, for clarity, we repeat that the adversary does not "observe" the context nor the secret. (For the scenario where the adversary can directly observe the context, the problem will reduce to designing |C| optimal channels according to optimisations as in Equation (7) with priors p X|c for each c ∈ C.) The defender decides what observable to produce per each secret in each context, potentially using randomisation and benefit from the ambiguity that it can inject. As before, the strategy has to satisfy some operational constraints. We may have hard constraints prescribing which secrets can produce which observables, which in part determine which subsets of secrets can be conflated with each other. In the previous sections, we expressed these "hard" operational constraints through Ω ⊆ X × Y, representing the set of permissible secret-observable pairs. In the presence of contexts, in the most general form, the permissible observables for a secret may depend on the context as well; thus, Ω should be now a subset of X × C × Y. However, for the result of this section, we assume that these constraints are context-independent, i.e., the same subset of observables is permissible for a secret irrespective of the context, so we keep Ω to be a subset of X × Y.
Likewise, there can be soft operational constraints in the form of satisfying a minimum expected utility. The expectation is now taken with respect to the context as well, that is, we must have expectation of the payoff with respect to X, C, Y to be no less than u min . However, for the result of this section, we assume that the payoff function u, i.e., the measure of "goodness" of each observable for each secret, does not depend on the context. Hence, ∑ x,c,y p(x, c)p(y|x, c)u(x, y) ≥ u min .
As before, without loss of generality, assume that we are dealing with core-concave functions, i.e., F is concave and η is increasing. Moreover, note that, again, the choice of the strategy cannot affect the prior entropy of the secret. Hence, the problem of designing for minimum leakage is again that is achieved by the optimal strategyp per Theorem 2 against this heuristic strategy of playing the best channel for each prior. As we can see, for any weight of the two priors (except trivially when the weight is either 0 or 1, where the two strategies become the same), thep strictly outperforms the heuristic strategy. Posterior Entropy Figure 3. Shannon's posterior entropy between the optimal design as per Proposition 2 and the heuristic best alternative, where the best channel for each prior is designed and played according to the context. The priors are as follows: P 1 = (1/3, 1/3, 1/3) (the "good" prior) and P 2 = (0.8, 0.15, 0.05) (the "bad" prior). The x-axis is the probability (weight) of P1. As we can see, except trivially for the two end-points, the optimal strictly outperforms this "best" heuristic.

Conclusions and Future Work
We investigated the problem of designing constrained channels that leak minimally about their input in a general information-theoretical setting. We generalised the notion of information leakage that encompassed a broad range of existing entropy-based measures and established that with respect to all of such measures, the problem of designing optimal channels is a convex optimisation with zero duality gap, where KKT conditions provide both necessary and sufficient conditions of optimality.
We then introduced a game-theoretical framework in which the channel designer is a defender against an information extracting adversary, and showed that the Bayes-Nash equilibrium strategies of the defender correspond to the optimal channels. The game-theoretical framework reveals a dual connection between our optimal channel design and a robust inference problem. In particular, the equilibrium strategies of the adversary solve the following interesting problem: Suppose we know the prior distribution of a random variable and the operational specification of a channel in terms of soft and hard constraints but the exact realisation of the channel is not known, and we would like to make the best inference about the input by observing the output of an instance of such channels. In particular, the equilibrium strategies of the adversary are robust, in the sense that they guarantee a minimal level of inference for any realisation of the channel within the family of the given constraints.
While in this work we emphasised the viewpoint of the defender, future work can investigate the adversary's problem of robust inference further. Moreover, as suggested by one reviewer, the implication of our results to a general system design and analysis, for instance in the sense of Žampa's systems theory [27], will be an interesting trajectory. This is inspired by the observation that our notion of a "channel" can be seen as an example of a "stochastic (abstract) system".