Asymptotically Optimal Adversarial Strategies for the Probability Estimation Framework

The probability estimation framework involves direct estimation of the probability of occurrences of outcomes conditioned on measurement settings and side information. It is a powerful tool for certifying randomness in quantum nonlocality experiments. In this paper, we present a self-contained proof of the asymptotic optimality of the method. Our approach refines earlier results to allow a better characterisation of optimal adversarial attacks on the protocol. We apply these results to the (2,2,2) Bell scenario, obtaining an analytic characterisation of the optimal adversarial attacks bound by no-signalling principles, while also demonstrating the asymptotic robustness of the PEF method to deviations from expected experimental behaviour. We also study extensions of the analysis to quantum-limited adversaries in the (2,2,2) Bell scenario and no-signalling adversaries in higher (n,m,k) Bell scenarios.


Introduction
Randomness has proven to be a valuable resource for a multitude of tasks, be it computation or communication. In cryptography, access to reliable random bits is essential, since the security of various cryptographic primitives is known to be compromised if the incorporated randomness is of poor quality [Dod+04; Aus+14;DY14]. In the study of random network modelling, being able to sample random graphs uniformly and (reliably) at random is crucial [Ors+15]. And for some problems, randomised algorithms are known to vastly outperform their deterministic counterparts [MR95].
A distinction between two notions of randomness, that of process and product, is discussed in [Sca19] (chapter 8). Although both notions are tightly connected, randomness of a process refers to its unpredictability, while that of a product refers to a lack of pattern in it. An unpredictable process will, with high probability, produce a sequence (a string of bits, say) that is patternless; on the other hand, a seemingly irregular string of bits might not be unpredictable and instead be a probabilistic mixture of pre-recorded information. While product randomness suffices for tasks like Monte Carlo simulations, sampling and those involving randomised algorithms, cryptographic applications involving an adversary necessitate process randomness.
Process randomness, while being non-existent in the strictest interpretation of any classical theory, is permissible in quantum mechanics; an important example of this is quantum non-locality as manifested in a Bell experiment. Quintessentially, the set up of a Bell experiment constitutes an entangled quantum system shared between two spatially separated stations A and B receiving inputs x and y, and recording outcomes a and b, respectively. If after n successive trials the observed correlations between the outcomes conditioned on the settings violate a Bell inequality then it can be ruled out that the outcomes were preassigned by some probabilistic mixture of deterministic processes. Also, the outcomes are (unpredictably) random, not only to the respective users of the devices at the two stations, but also to an adversary, even to one having a complete understanding of the Bell experiment. This relationship between non-locality in quantum mechanics and its random nature is at the foundation of various device independent random number generation protocols.
Device independence is considered a gold standard in cryptographic tasks such as quantum random number generation and quantum key distribution, in which the respective users are not required to know or trust the inner machinery of their devices, thus treating them as mere black boxes to which they can provide inputs and record outcomes. The only assumption that the experimental setup must satisfy is that the measurement choices of the devices must be uncorrelated with their inner workings. This is the measurement independence assumption, which is ultimately untestable, but is tacitly assumed, arguably, in almost all scientific experiments. The no-signalling condition that the outcome recorded at each station is not influenced by the choice of measurement at the other station, holds throughout the experiment because of a space-like separation between the stations and the impossibility of superluminal signalling in accordance with the special theory of relativity. Furthermore, the adversary trying to simulate the observed statistics may be considered computationally unbounded, a standard that falls under the paradigm of information-theoretic security. Over the years, technological advancement has facilitated loophole-free Bell non-locality experiments which have not only provided experimental validation for ruling out a classical description of nature [Giu+15; Sha+15; Hen+15; Ros+17], but have also found practical applications in device independent quantum randomness generation and device independent quantum key distribution [Bie+18;Sha+21;Li+21].
The probability estimation framework is a broadly applicable framework for performing device independent quantum randomness generation (DIQRNG) upon a finite sequence of loophole-free Bell experiment data, and involves direct estimation of the amount of certifiable randomness by obtaining high confidence bounds on the conditional probability of the observed measurement outcomes conditioned on the measurement settings in the presence of classical side information [ZKB18; KZB20; BZ20]. Advantageous primarily for its demonstrated applicability to Bell tests with small Bell violations and high efficiency for a finite number of trials, it also can accommodate changing experimental conditions and allows early stoppage upon meeting certain criteria. Also, it can be extended to randomness generation with quantum devices beyond the device independent scenario.
The probability estimation framework for DIQRNG is provably secure against adversaries who do not possess entanglement with the sources. Security against more general adversaries, with quantum entanglement with the sources, is possible with the quantum estimation framework [KZF18], for which the constructions of the probability estimation framework can often be translated to the quantum estimation framework (as was done in [Zha+20]), so that progress with the former framework can often be used for the more general latter framework [Sha+21].
The asymptotic optimality of the probability estimation framework was discussed in [KZB20]. The specific result of asymptotic optimality is as follows: given a sufficiently large number of trials sampling from a fixed behaviour (i.e., a set of quantum statistics), the amount of certified randomness per trial is arbitrarily close to a certain upper limit. Then [KZB20] argues, appealing to convex geometry and the asymptotic equipartition property (AEP), that an adversary can always implement a probabilistic mixture of strategies, independent and identically distributed across successive experimental trials, that generates observed statistics consistent with the fixed behaviour while not needing to generate more than that same upper limit of randomness per trial that is certified by the probability estimation framework. This is important in the sense that the framework certifies all the randomness conceded by the adversary in that particular attack, while also showing that there is no advantage to be gained for the adversary by resorting to (more sophisticated) memory attacks.
In this paper, we provide a full derivation of the asymptotic optimality of the probability estimation framework, filling in some steps omitted by [ZFK20], along the way obtaining a better characterisation of the adversary's optimal probabilistic mixture for generating the observed statistics. A better understanding of the optimal attack in the asymptotic regime will set a benchmark enabling the implementer of the protocol to defend against these attack modes. Making precise the arguments from convex geometry, we explicitly describe the optimal strategy that an adversary (restricted only by the no-signalling condition) can employ with the minimum required number of different strategies in convex mixture to simulate the observed statistics. Our improvement, with a more self-contained approach, upon the result in [KZB20] is to reduce the smallest number of required strategies by one. Specifically, the smallest number of possible strategies is one more than the dimension of the set of admissible distributions of a trial. (We assume the set of admissible probability distributions of a given trial to be closed and convex, where we can take the convex closure when this assumption is not met; then the dimension dim(C) of a non-empty convex subset C of X is the dimension of the smallest affine subset containing C.) Our derivation elucidates how only the classical form of the asymptotic equipartition property is needed for the probability estimation framework, allowing a simplified treatment. We also considered the question of robustness of the probability estimation framework, deriving a sufficient condition for a probability estimation factor (optimised at a particular distribution) to certify randomness at a positive rate at a statistically different distribution.
We apply our results to the (2,2,2) Bell scenario (the scenario of two parties, two measurement settings, and two outcomes), obtaining an analytic characterisation of the optimal attack of an adversary (restricted only by the no-signalling condition) holding classical side information. We show that the optimal adversarial attack involves a decomposition of the observed statistics in terms of a single extremal no-signalling (super-quantum) correlation and eight local deterministic correlations. The proof of optimality relies upon the fact that equal mixtures of two extremal no-signalling non-local super-quantum correlations are expressible as an equal mixture of four local deterministic correlations. We show that this result does not generalise to higher scenarios such as the (3,2,2), (2,3,2) and (2,2,3) Bell scenarios, thereby indicating that the possibility of an optimal attack involving only a single extremal strategy is only assured in the minimal (2,2,2) Bell scenario. Furthermore, we considered the possibility of an adversary holding classical side information (and hence, restricted to probabilistic attack strategies), but trying to simulate the observed statistics using quantum-achievable probability distributions, while conceding as little randomness as possible. Assuming uniform settings distribution, numerical studies restricted to a two-dimensional slice of the set of quantum-achievable distributions provided some initial evidence that the optimal quantum-achievable attack strategy involves only one extremal quantum correlation, but we were not able to settle this and have phrased it as a conjecture.
The rest of the article is organised as follows: In Section 2, we review the probability estimation framework where Theorem 1 formalises the central idea and Theorem 2 establishes a lower bound on the smooth conditional min-entropy of the sequence of outcomes conditioned on the settings and side-information. We also present a simplified proof of Lemma 1, an important result enabling the algorithm for executing the PEF method, as compared to the proofs in [ZKB18;KZB20]. In Section 3, we present our complete proof of asymptotic optimality, study the implications for finding an optimal adversarial attack strategy, and derive a result on robustness. In Section 4, we apply our results to the (2,2,2) Bell scenario obtaining an analytic characterisation of the optimal attack strategy for an adversary restricted only by the no-signalling condition. The optimal attack comprises of a decomposition of the observed statistics in terms of a single Popescu-Rohrlich (PR) correlation and (up to) eight local deterministic correlations. We show that for a higher number of parties, settings, and/or outcomes, a crucial result from the (2,2,2) Bell scenario concerning equal mixtures of extremal non-local no-signalling correlations does not hold, and infer that the optimal attack may require more than one non-local distribution in general. Returning to the (2,2,2) scenario, we discuss a conjecture that the optimal strategy to mimic the observed statistics by means of a probabilistic mixture of quantum-achievable correlations constitutes only a single extremal quantum correlation and (up to) eight local deterministic correlations.

The Probability Estimation Framework
The probability estimation method relies on the probability estimation factor (PEF), which is a function assigning a score to the results of a single trial of a quantum experiment, with higher scores corresponding to more randomness. The paradigmatic application is to a Bell non-locality experiment comprising multiple spatially separated parties providing inputs (measurement settings) to measuring devices and recording outputs (observed outcomes); an experimental trial's results then consist of both the choice of inputs and the recorded outputs for that trial. After many repeated trials the product of the PEFs from all the trials is used to estimate the probability of outcomes conditioned on the settings.
For the examples considered in Section 4, we will consider the canonical scenario of two measuring parties Alice and Bob each selecting respective binary measurement settings X and Y and recording respective binary outcomes A and B, which we refer to as the (2,2,2) Bell scenario. For now we treat things in a general manner as is done in [ZKB18] and [KZF18], modelling the trial settings for all parties and outcomes for all parties with single random variables Z and C, respectively, taking values from respective finite-cardinality sets Z and C. When applied to the (2,2,2) Bell scenario, C comprises the ordered pair (A, B) and Z comprises the ordered pair (X, Y ). Fig. 1: A schematic representation of the set-up for device independent randomness generation in a two-party experiment. The outer rectangular box represents a secure location.
The adversary E has perfect knowledge of the processes inside the secure location but cannot tamper with them. The state Ψ ABE represents the resource shared between the two parties. X k , Y k are the trial inputs and the A k , B k are the trial outcomes for the kth trial.
The results of a sequence of n time-ordered trials are represented by the sequences C = ; and so, (C, Z) realises values (c, z) ∈ C n × Z n , where C n , Z n are the n-fold Cartesian products of C, Z. A PEF is then a real-valued function of C and Z satisfying certain conditions, while the product of PEFs from all trials will be a function of C and Z. High values of the PEF product will correlate with low values of P(C|Z), the conditional probability of the outcomes given the settings.
To define PEFs, we introduce the notion of a trial model : A set Π encompassing all joint probability distributions of settings and outcomes which are compatible with basic assumptions about the experiment. One important trial model that we consider is Π Q , consisting of joint distributions of (C, Z) for which the conditional distribution of C conditioned on Z can be realised by a measurement on a quantum system. Here we introduce the convention, used throughout, of using lower case Greek letters with random variables as arguments to denote distributions, i.e., µ(C, Z) and µ(C|Z) denote the joint distribution of (C, Z) and the conditional distribution of C given Z, respectively. Another important trial model is Π NS (NS stands for "no-signalling"), consisting of distributions for which probabilities of measurement outcomes at one location are independent of measurement settings at the other distant locations. (This is more clearly understood in considering the Alice-Bob example, where one of the no-signalling conditions is that for all a, b, x and y ̸ = y ′ .) A third important trial model is the set Π L of distributions for which the conditional distribution of outcomes conditioned on settings are local, which means they can be expressed as convex mixtures of local deterministic behaviours. In the bipartite setting, the conditional distribution µ LD,λ (A, B|X, Y ), also referred to as a behaviour, is local determin- represents the function that evaluates to 1 if the condition within holds, 0 otherwise). In words, the outcomes are functions of the local settings and the local hidden variable λ which can be understood to be a list of outcomes for all possible settings. A formal definition involving more parties and an arbitrary (albeit same) number of outcomes and settings for each party can be found in (46). The sets Π L , Π Q and Π NS satisfy the following strict inclusions: Certain distributions in Π Q and Π NS violate a Bell inequality and are known to contain randomness, they are contained in Π Q \ Π L and Π NS \ Π L , respectively. It is precisely the inability to decompose such distributions into deterministic ones, as in Π L , that implies the presence of randomness. The objective of the PEF approach is to quantify the randomness contained in such distributions. As trial models specify the joint distribution µ(C, Z), and for the above examples of trial models we gave only the conditional distributions µ(C|Z), one must also specify the marginal distribution of the settings µ(Z). For the discussions of Π Q and Π NS in subsequent sections, any fixed distribution satisfying µ(Z = z) > 0 for all z ∈ Z is permitted. An example of a fixed settings distribution is the equiprobable distribution Unif(Z) defined as Unif(z) = 1/|Z| for all z ∈ Z. As a discrete probability distribution is effectively an ordered list of numbers in [0, 1] (the probabilities), trial models are always subsets of R N , where N is fixed by the cardinality of C and Z. This enables us to use a geometric approach to study these sets, which prove to be invaluable for some arguments.
We can now define PEFs. We use the notation E µ [. . .] and P µ (. . .) to denote expectation and probability, respectively, with respect to a distribution µ; and for the sake of notational concision we sometimes omit commas in distributions or functions of more than one random variable, for instance, µ(CZ) and f (CZ) must be understood to mean µ(C, Z) and f (C, Z).
In the expression above, σ(C|Z) denotes a random variable that is a function of the random variables C and Z: σ(C|Z) is the random variable that assumes the standard conditional probability (according to σ) of C taking the value c conditioned on Z taking the value z; it is assigned the value zero if the probability σ(Z = z) is zero. The parameter β can be any positive real value. We then note that the constant PEF F (cz) = 1 for all (c, z) ∈ C × Z is a valid PEF for any choice of β > 0. We will notice in the subsequent sections, however, that the parameter does have an effect on the method employed for choosing useful PEFs for the purpose of randomness certification; and in practice we choose the value of β that corresponds to the maximum randomness certification.
Prior to defining a PEF we introduced the notion of a trial model. For the application of probability estimation to the outcomes of an experiment, which is a sequence of n timeordered trials, we introduce the notion of an experiment model : It is a set Θ constraining the joint distribution of C,Z and E, constructed as a chain of individual trial models Π; it consists of joint distributions µ(CZ|E = e) conditioned on the event {E = e}, where E is the random variable denoting the adversary's side information and realising values e from the finite set E. It satisfies the following two assumptions: In (1), C ⩽i , Z ⩽i denote the outcomes and measurement settings for the first i ∈ [n] trials, where [n] := {1, 2, . . . , n}, with c ⩽i , z ⩽i denoting their respective realisations. The random variables C i+1 , Z i+1 are the outcomes and settings for the (i + 1)'th trial. The first condition in (1) formalises the assumption that the (joint) probability of the (i + 1)'th outcome and setting, conditioned on the outcomes and settings for the first i trials and each realised value E = e of the adversary's side information, belongs to the (i + 1)'th trial model, i.e., it is compatible with the conditions dictated by the trial model. The second condition states that for each E = e the setting for the next trial is independent of the outcomes and settings of the past and present trials. Our second condition is a stronger assumption than the corresponding assumption given in [ZKB18], which is as follows: The joint distribution µ of CZE is such that Z i+1 is independent of C ⩽i conditionally on both Z ⩽i and E. It is a straightforward exercise to check that our stronger assumption implies the one stated in [ZKB18]. While the weaker assumption is sufficient for the following result, we find the stronger assumption operationally clearer as an assumption that the future settings are independent of "everything in the past" for each realisation of e.
For the rest of the paper we adopt the abbreviated notation of µ y (X) for µ(X|Y = y). The following theorem, appearing as Theorem 9 in Appendix C in [ZKB18], formalises the central idea behind the framework of probability estimation. We include a proof for this theorem in Appendix A for completeness.
holds for each e ∈ E, where F i (C i Z i ) is the probability estimation factor for the i'th trial.
Proof. See Appendix A. The inequality (2) in Theorem 1 can be understood, intuitively, as follows: When the trial-wise product n i=1 F i (C i Z i ) of the PEFs is large and so for fixed ϵ, β > 0, the quantity (ϵ n i=1 F i (C i Z i )) −1/β is small, for each e ∈ E there is a very small probability (denoted by the outer probability P µe (·)) that the conditional probability of the sequence of outcomes C conditioned on the sequence of settings Z (denoted by µ e (C|Z)) is more than a small value.
For information-theoretic purposes, it is useful to translate the bound in (2) into a statement about min-entropy with respect to an adversary. An adversary's goal is to predict C. Conditioned on a particular realisation of the settings sequence z ∈ Z n and side information e ∈ E, one can measure the "predictability" of the sequence of outcomes C with the following maximum probability: max c∈C n µ(c|ze). It quantifies the best guess of the adversary. The ze-conditional min-entropy of C, corresponding to that particular realisation ze ∈ Z n × E, is the following negative logarithm: The subscript µ in the notation H ∞,µ (· · · ) refers to the distribution µ(CZE). The average ZE-conditional min-entropy is then defined as follows: But, information-theoretic security of cryptographic protocols take into account a more realistic measure of average ZE-conditional min-entropy which involves a smoothing-parameter ϵ, a type of error bound, and is known as the ϵ-smooth average ZE-conditional min-entropy. This quantity is useful for our scenario in which the probability distribution is not known exactly and its characteristics can only be inferred from observed data, which introduces the possibility of error. It is defined as follows.
Definition 2 (Smooth Average Conditional Min-Entropy). For a distribution µ : where ϵ ∈ (0, 1) and d TV (σ, µ) is the total variation distance between σ and µ defined as The ϵ-smooth average ZE-conditional min-entropy is then defined as follows.
The lower bound obtained on this quantity goes as one of the inputs to extractor functions in randomness extraction, whose purpose is to convert random functions with uneven distributions into shorter, close to uniformly distributed bit strings. We note that alternative definitions of ϵ-smooth conditional min-entropy can be used, for instance, the ϵ-smooth worst-case conditional min-entropy of [RW05]. A known result from the literature, proven in Proposition 2 in Appendix E, justifies our usage of the ϵ-smooth average conditional min-entropy without having to concern with the stricter ϵ-smooth worst-case conditional min-entropy (defined in (77)): specifically, the two quantities converge to one another in the asymptotic limit.
The result obtained from Theorem 1 can be translated into a result on smooth average conditional min-entropy formalised in Theorem 2 below. This theorem appears as Theorem 1 in [ZKB18]. We include a proof for this theorem in Appendix A for completeness. In the notation of ϵ-smooth average ZE-conditional min-entropy in (5), the semicolon followed by S denotes that this information-quantity is assessed with respect to the distribution µ after conditioning on the occurrence of the event S defined in the statement of Theorem 2. It pertains to an abort criterion. The protocol succeeds only if the product of the trial-wise PEFs exceeds some threshold value, otherwise it is aborted. So we want to establish the lower bound for smooth conditional min-entropy conditioned on the event that the protocol succeeds, because it is precisely this scenario in which we extract randomness. Since a completely predictable local distribution can always have a chance of passing the protocol, however minuscule (on the order of (3/4) n , where the number of trials n often goes up to millions)-and µ(c|z) will equal 1 in this case-it is necessary to assume a small but positive lower bound on the probability of not aborting to derive a useful min-entropy bound. This can be thought of as another type of error parameter. The assumed lower bound for the probability of success of the protocol is κ.
Theorem 2. Let µ be a distribution µ : C n × Z n × E → [0, 1] of C, Z, E such that for each e ∈ E, the following holds for every ϵ ∈ (0, 1): where F i is a PEF with power β for the i'th trial. For a fixed choice of ϵ ∈ (0, 1) and p ⩾ |C| −n , define the event S : the following holds: Under the same conditions of Theorem 2, the main result (5) admits a minor reformulation as follows. This is the formulation that aligns with the statement of Theorem 1 in [ZKB18]: Corollary 1. Let µ : C n × Z n × E → [0, 1] be a distribution of CZE and F be a PEF with power β such that (4) holds for each e ∈ E. For a fixed choice of ϵ ∈ (0, 1), p ⩾ |C| −n , and Proof. Use Theorem 2 with ϵ ′ = κϵ, p ′ = p/κ 1/β , and κ ′ = κ, noting that since 0 < κ ⩽ 1 and β > 0 hold, we have ϵ ′ ∈ (0, 1) and p ′ ⩾ |C| −n as required for invoking the theorem. Then notice the corresponding event S ′ = (ϵ ′ n i=1 F i ) −1/β ⩽ p ′ aligns with the event S. The above results hold when we consider distributions µ : C n × Z n × E n → [0, 1] of CZE, i.e., where the side information is structured as a sequence of random variables. The proof remains the same with the exception that we condition on an arbitrary sequence of realisation e ∈ E n of E. We consider this scenario in Section 3 where we define an IID attack from the adversary.
Theorem 1 does not indicate how to find PEFs. One way to find useful PEFs is to first notice that the success criterion of the protocol is the event S that the inequality (ϵ n i=1 F i ) −1/β ⩽ p holds, which can be equivalently expressed as where ϵ, β and p are pre-determined quantities to be chosen in advance of running the protocol. Then considering an anticipated trial distribution ρ(CZ) based on observed results and calibrations from previous trials, in the limit of sufficiently large n the difference between the term on the left hand side of (7) (which consists of the trial-wise sum of (base-2) logarithm of PEFs) and nE ρ [log 2 (F (CZ))/β] will be either greater or less than zero with roughly equal probability. This follows from the Central Limit Theorem if the distribution remains roughly stable from trial to trial. Since it is desirable to have the largest value of − log 2 (p) possible, one can then perform the following constrained maximisation using any convex programming software owing to the concavity of the objective function and the linearity of the constraints.
Maximise: E ρ [(n log 2 (F (CZ)) + log 2 (ϵ))/β] Since n, ϵ and β are fixed, it is sufficient to maximise E ρ [log 2 (F (CZ))] subject to the same constraints. In practice, one can consider a range of values of β and perform the constrained maximisation with the objective E ρ [log 2 (F (CZ))], then plug in the maximum value in the expression E ρ [(n log 2 (F (CZ)) + log 2 (ϵ))/β] and obtain a plot with respect to the considered range of β values (see, for example, Figure 2 in [BZ20]; a similar pattern is observed in Figure 2 in Section 4).
The following lemma (from [ZKB18], see Lemma 15)-for which we provide a more direct proof-enables us to restrict the satisfiability constraints of the optimisation routine in (8) to the extremal distributions of the model Π under the condition that the model is convex and closed. So, the first line of constraints in (8) can be replaced with E ν [F (CZ)ν(C|Z) β ] ⩽ 1, ∀ν(CZ) ∈ Π extr , where Π extr is the set of extremal distributions of Π. If the model Π is not convex and closed, we take its convex closure. In words, the lemma states that if F (CZ) is a PEF with power β > 0 for the distributions σ 1 (CZ) and σ 2 (CZ), then it is a PEF with the same power for all distributions that can be expressed as a convex combination of σ 1 and σ 2 .

Asymptotic Performance
The results of the previous section give us a method for certifying randomness. In this section, we assess the asymptotic performance of the method. Our figure of merit is the amount of randomness certified per trial, as measured by the average conditional min-entropy divided by the number of trials n. We will see in this section that the PEF method is asymptotically optimal, in the following sense: given a fixed observed distribution, the PEF method can asymptotically certify an amount of per-trial conditional min-entropy that is equal to the actual per-trial conditional min-entropy generated by an adversary replicating the observed distribution with as little randomness as possible.
To elaborate on this, consider that the adversary's goal is to minimise the following quantity: 1 n H avg ∞,µ (C|ZE). We assume that the adversary has complete knowledge of the distribution µ, and can have access to not just the realised value of E, but also the realised value of Z in guessing C. This access to Z aligns with the paradigm, as discussed in [Bie+18], of "using public (settings) randomness to generate private (outcome) randomness". The adversary is constrained, however, in that the statistics when marginalised over E must appear to be consistent with an expected observed trial distribution ρ(CZ) for the protocol to not abort. Technically, all that is necessary for the protocol to pass is that the observed product of the PEFs must exceed some threshold value chosen by the experimenter-which could be possible with high probability with many different distributions µ-but as the experimenter's threshold value will likely be chosen based on a full behaviour that they expect to observe, we study attacks that match the expected observed trial distribution exactly. We will find attacks meeting this criterion that are asymptotically optimal for minimising the conditional min-entropy.
Given an expected observed distribution, how can the adversary generate observed statistics consistent with it while yielding as little randomness as possible? She can employ a strategy of preparing multiple different states to be measured that will yield different distributions, each one consistent with the trial model Π, whose convex mixture is equal to the observed distribution. If she has an auxiliary random variable E realising values from the finite-cardinality set E and recording which state was prepared on which trial, she can predict better the outcome conditioned on her side information E = e, in conjunction with the settings Z. Indeed, some of her e-conditional distributions could be deterministic, in which case she does not yield any randomness to Alice and Bob on a trial where E takes that value. † But if the overall observed statistics are non-local, then she is forced to prepare at least some states that contain randomness even conditioned on e; this, in essence, is because the information that she possesses with E is a local hidden variable.

I.I.D. Attacks
Given a convex decomposition of the observed distribution, the adversary's simplest form of an attack is to select e from some finite-cardinality set E in an i.i.d manner on each trial according to the distribution that recovers the observed distribution ρ(CZ). A more general attack would allow her to use memory of earlier trials, but we will see later that, asymptotically, this does not yield meaningful improvement.
Operationally, we do not like to think of the adversary accessing the devices in between trials to provide a choice of e i for each trial. Instead, one can imagine her randomly sampling from the distribution of E for all trials, coming up with a choice e that encodes all the choices of e i for each trial, and then supplying this choice to the measured system, in advance, to determine its behaviour in each trial. She keeps a record of e to help her predict C later. Through this sampling process there is a small chance that she will sample an atypical "bad" e that results in statistics deviating from the observed distribution, but the probability that her e is typical is asymptotically high. Our figure of merit for the adversary now is: which she wants to minimise with a distribution that, marginalised over E, is consistent with i.i.d sampling from an expected observed distribution ρ. We formally define the set of distributions ω : C × Z × E → [0, 1] of C, Z, E mimicking ρ through such a convex decomposition as follows, where e is shorthand for the event {E = e}: Then an IID attack can be defined as follows. † A deterministic distribution must be understood as the product of a fixed settings distribution and a deterministic behaviour (conditional distribution of the outcomes conditioned on settings).
Definition 3 (IID Attack). Given a distribution ω(CZE) ∈ Σ ρ E , we define an IID attack (with ω) to be the distribution ϕ consisting of n independent and identical realisations of random variables C i , Z i , E i distributed according to ω; i.e., the joint distribution of the sequence of random variables C, Z, E is ϕ : As mentioned earlier, the adversary randomly samples from the distribution of E which represents their knowledge of all trials; e ≡ (e 1 , e 2 , . . . , e n ) ∈ E n encodes the individual choices e i for trial i ∈ {1, 2, . . . , n}. The IID attack model satisfies the two assumptions of the experiment model discussed earlier (see (1) and the short discussion that follows immediately). Namely, the (joint) probability of the (i + 1)'th trial outcome and input setting, conditioned on each realisation of the outcomes and settings for the first i trials and each realisation e ∈ E n of the side information, satisfies the conditions of the trial model; and conditioned on each e ∈ E n , the settings for the (i + 1)'th trial are (unconditionally) independent of the outcomes and settings of the past and present trials (i.e., the first i trials). This is formally stated and proved in Lemma 2 below.
Lemma 2. The IID attack strategy as defined in Definition 3 satisfies the following conditions.
Next, the adversary would like to implement an attack that "generates as little randomness as possible". One measure of the randomness is the conditional Shannon entropy of the outcomes C conditioned on the inputs Z and the side information E.
Definition 4 (Conditional Shannon Entropy). For a distribution µ : C × Z × E → [0, 1] of C, Z, E the conditional Shannon entropy of the outcomes C conditioned on the settings Z and the side information E is defined as The Greek letter µ in the subscript of H µ (·|·) refers to the distribution µ(CZE) with respect to which the conditional Shannon entropy is defined.
Theorem 3 below shows that H ω (C|ZE) is an asymptotic upper bound on the pertrial conditional min-entropy that the adversary generates with an IID attack employing a trial distribution ω that is consistent with the observed distribution ρ. This result was discussed but not demonstrated explicitly in [KZB20]. The proof of Theorem 3 involves one of the fundamental technical tools from information theory, the (classical) Asymptotic Equipartition Property (AEP), or equivalently the notion of typical sequences which has the weak law of large numbers at its core.
Suppose µ, the distribution of all trials, is obtained as n i.i.d. copies of a single-trial distribution ω. Then for ϵ a ∈ (0, 1), is the conditional Shannon entropy. We refer to this as the AEP condition; it holds by a conditional form of the classical AEP (see, for instance, Section 14.6 in [Wil13]). The set B ϵs (µ) of distributions of C, Z, E that are within a TV-distance of ϵ s from µ and the sets A ze are as defined below: where A ze is defined for any ze for which µ(ze) > 0. Note that the case ϵ s = 0 reduces to a bound on the standard (non-smooth) average conditional min-entropy. We now state the result as follows.
Theorem 3. Let µ be an IID attack with ω. For ϵ s ⩾ 0, ϵ a , δ > 0 and ϵ a + 2ϵ s < 1, there exists N (ϵ a , ϵ s , δ) such that for n ⩾ N (ϵ a , ϵ s , δ) Proof. Throughout, we follow the convention that σ(c|ze) = 0 for all c ∈ C n for any ze ∈ Z n × E n with σ(ze) = 0. We begin with the inequality d TV (σ, µ) ⩽ ϵ s that any σ ∈ B ϵs (µ) must satisfy and proceed as follows: The inequality in (20) follows as a result of the sum containing fewer terms; the inequality in (21) follows from the triangle inequality. Now from the AEP condition mentioned above we have the following: For any σ ∈ B ϵs (µ), we define M σ ze for any ze ∈ Z n × E n as M σ ze := max c∈C n σ(c|ze). The average conditional maximum probability is then expressed asM σ := ze M σ ze σ(ze). Now, because 1 ⩽ c∈Aze µ(c|ze) ⩽ γ|A ze |, we have |A ze | ⩽ 1/γ for each ze, and we can write: Using (22) and (23) we obtainM σ ⩾ γ(1 − ϵ a − 2ϵ s ) from which (19) follows using the definition of smooth average conditional min-entropy.
Having shown that the per-trial min-entropy generated by an IID attack is asymptotically bounded by the conditional Shannon entropy, we give the following definition of an optimal attack.
Definition 5 (Optimal IID Attack). The distribution µ(CZE) of the sequence of random variables C, Z, E is an optimal IID attack strategy if µ is obtained through an IID attack based on a single-trial distribution ω whose conditional Shannon entropy achieves the infimum defined below: Additional motivation for naming the attack of Definition 5 optimal is provided by later results in this section, which show that the adversary must generate at least h min (ρ) of pertrial conditional min-entropy asymptotically with any attack that replicates the observed distribution ρ.
In the theorem that follows, we formalise the claim that the infimum in (24) is achieved. This theorem corresponds to Theorem 43 in [ZFK20]; in comparison, the comprehensive proof provided here explicitly works out more of the steps. Crucially, this explicit approach also allowed us to provide an improvement upon the result of Theorem 43 in [ZFK20], decreasing the required value of |E| by one, thereby better characterising the adversary's optimal attack. Results in Section 4.2 will illustrate that no further improvement, i.e., a decrease in |E|, is possible.
Theorem 4. Suppose Π is closed and equal to the convex hull of its extreme points. Then Proof. See Appendix B.
Theorem 4, in conjunction with the bound in Theorem 3, sets a benchmark for how well the adversary can do with an IID attack that replicates the observed distribution ρ(CZ). Specifically, the adversary's goal is to minimise the amount of per trial conditional minentropy, and this shows there exists a strategy to replicate the observed statistics while conceding no more min-entropy per trial than h min (ρ), asymptotically.

Optimal PEFs
We now show that PEFs can asymptotically certify a min-entropy of h min (ρ) per trial from an observed distribution ρ . This is notable since it shows that an IID attack can be asymptotically optimal: since the PEF method certifies the presence of h min (ρ) min-entropy per trial against any attack, this means no attack can generate observed statistics consistent with ρ while conceding a smaller amount of randomness. This furthermore demonstrates that there is nothing to be gained (asymptotically) by the adversary employing a more sophisticated memory-based attack, since the PEF method allows for the possibility of memory attacks. Conversely, the below results show that the PEF method is asymptotically optimal: no (correct) method can certify more min-entropy per trial from ρ than the amount that is present in an explicit attack.
To formalise and prove these claims, we use the following technical tool, called an "entropy estimator" as in [KZB20].
Definition 6 (Entropy Estimator). An entropy estimator of the model Π is a function Given an entropy estimator K(CZ), we say that its entropy estimate at a distribution σ(CZ) is E σ [K(CZ)]. We will see below that an entropy estimator can be used to construct PEFs certifying per-trial min-entropy arbitrarily close to its entropy estimate, underlying the significance of the following result: Theorem 5. Suppose Π satisfies the conditions of Theorem 4 and ρ is in the interior of Π. Then there exists an entropy estimator whose entropy estimate at ρ is equal to h min (ρ).

Proof. See Appendix B.
The assumption that ρ is in the interior of Π will generally hold if ρ is estimated from real data, as the boundary of Π is a measure zero set. If the assumption is removed, a weaker version of the theorem can still be obtained, which is discussed in the proof in Appendix B.
The entropy estimator K(CZ) whose existence is guaranteed by the above theorem can be used to show the existence of a family of PEFs that can get arbitrarily close to certifying h min (ρ) amount of per-trial min-entropy. However, for a precise formulation of this claim we need a way to measure the asymptotic rate of min-entropy using PEFs. Recall from (6) that we can lower-bound the per-trial min-entropy certified by a PEF as: As in [ZFK20], we ignore the log 2 (κ) term in the asymptotic regime, as the completeness parameter κ can be thought of as a "reasonable" lower bound on the probability that the protocol does not abort, a type of error parameter that one might try to decrease somewhat for longer experiments but not at the exponential decay rate required to make this term asymptotically significant. Focusing then on the −(1/n) log 2 (p) term, recall that success of the protocol is determined by the occurrence of the event S : ⩽ p , the inequality in which can be expressed equivalently as: The expression on the left hand side of the above inequality is the negative base-2 logarithm of the upper bound on µ e (C|Z) for each e ∈ E n (refer to (2) and the comments following Corollary 1) and so is a rough measure of the amount of randomness, up to an error probability of ϵ, present in the outcome data. More concretely, since p will be chosen to make −(1/n) log(p) as large as reasonably possible to optimise min-entropy certified by (25), the anticipated value of the left hand side quantity can be used as a measure of certifiable randomness. For a stable experiment (i.e., one with each trial having the same distribution σ belonging to the same model Π), the quantity (1/n) n i=1 log 2 (F i )/β approaches E σ [log 2 (F (CZ))]/β in the limit n → ∞, while the term (1/nβ) log 2 (ϵ) goes to zero for any fixed value of β and ϵ. Hence we introduce the following quantity as a measure of per-trial min-entropy certified by a PEF.
We say that a PEF certifies randomness at a distribution ρ if the quantity O ρ (F ; β) is positive. We note that this definition is consistent with our expectation that only nonlocal distributions allow the certification of randomness, as the log-prob rate for a local distribution is a non-positive number, i.e., O σ L (F ; β) ⩽ 0: A local behaviour is a convex mixture of (finitely many) local deterministic behaviours σ LD (C|Z). Hence, with a fixed settings distribution π(z) > 0, the defining condition E σ [F (CZ)σ(C|Z) β ] ⩽ 1 of a PEF for a distribution defined as σ(cz) = σ LD (c|z)π(z), for all c, z, is equivalently expressed as E σ [F (CZ)] ⩽ 1, since σ LD (c|z) is either 0 or 1 for all c, z. Due to the concavity of log function, we then have E σ [log 2 (F (CZ))] ⩽ log 2 (E σ [F (CZ)]) ⩽ 0 using Jensen's inequality. Hence, no device-independent randomness can be certified at a local-realistic distribution.
Theorem 6. Given an entropy estimator K(CZ) and an observed distribution ρ(CZ), for any ϵ ∈ (0, 1/2) there is a PEF whose log-prob rate at ρ is greater than Our proof follows the general approach of Theorem 41 in [KZB20], though we are able to shorten the argument.
Proof. Given an entropy estimator K(CZ) and ϵ ∈ (0, 1/2) from the statement of the theorem, for any γ > 0 we can define a function We will show that there exists a (small) positive value of γ for which F (CZ) is a PEF with power β = γ; the asymptotic log-prob rate of this PEF at ρ will then be E ρ [log 2 (F (CZ))]/β = E ρ [K(CZ)]−ϵ as desired. So our task is to find a value of γ such that the following inequality holds for all σ ∈ Π: We study the left side of the above expression as a function of γ; specifically, define a function which is, for any fixed choice of σ and K(CZ), a convex combination of positive constants raised to the power of γ and so is infinitely differentiable at all γ ∈ R. (Note that we never encounter the problematic form 0 0 because the argument of [·] γ will always be strictly positive, as the sum defining f σ extends only over values of c, z for which σ(cz) is positive, and hence σ(c|z) > 0.) We can thus Taylor-expand f σ about γ = 0, obtaining via the Lagrange remainder theorem that for any positive γ, there exists a k ∈ (0, γ) making the following equality hold: The first term in the expansion satisfies f σ (0) = cz 1 · σ(cz) = 1. The coefficient of γ in (27) satisfies: where the inequality follows from the condition E σ [K(CZ)] ⩽ E σ [− log 2 (σ(C|Z))] in the definition of an entropy estimator. Hence (27) yields for some k ∈ (0, γ). Now, given a fixed γ, k may be different in (28) for different choices of σ; however, it must always lie in the interval (0, γ), so if we can show that there is a choice of γ such that for any σ the following inequality holds for all k ∈ (0, γ) then for that value of γ, we will know that F (CZ) as defined in (26) is a valid PEF satisfying the conditions of the theorem. To find the needed value of γ making (29) hold and complete the proof, we calculate where M = max cz 2 K(cz) . We now assert that each quantity σ(c|z) k+1 [K(cz) − ϵ + log 2 (σ(c|z))] 2 is bounded above by a constant N cz for all k > 0, and N cz is independent of σ. This follows because for any fixed choice of c and z, this quantity is strictly smaller than the expression g cz (x) = x [K(cz) − ϵ + log 2 (x)] 2 for the choice of x = σ(c|z) ∈ (0, 1] (note that since σ(c|z) ∈ (0, 1], σ(c|z) k+1 ⩽ σ(c|z) holds for any k > 0). Then two applications of l'Hôpital's rule demonstrate that lim x→0 g cz (x) exists and so g cz can be extended to a continuous function on [0, 1] where it has a maximum by the extreme value theorem. † Referring to this maximum as N cz and letting N = max cz N cz , we get the desired bound as shown below.
This shows that if M k γ ⩽ 2ϵ/|C|N holds, then (29) holds, from which it follows that a sufficiently small choice of γ > 0 makes (29) hold for all k ∈ (0, γ).
The combination of Theorem 5, which shows the existence of an entropy estimator with entropy estimate h min (ρ), and Theorem 6, which enables the construction of a family of PEFs with log-prob rate arbitrarily close to this entropy estimate, demonstrates of the asymptotic optimality of the PEF method.

Robustness of PEFs
We want to consider a question not considered in the previous PEF papers: can a PEF optimised for ρ(CZ) certify randomness for a distribution different from ρ, where the difference is measured in terms of the total variation distance between them; in other words, how robust is the PEF? We will see in the next section that in the (2, 2, 2) Bell scenario, for any behaviour corresponding to ρ violating the CHSH-Bell inequality, PEFs can be (up to any desired ϵ-tolerance) asymptotically optimal in terms of log prob rate at ρ while also generating randomness at a positive rate for any behaviour (corresponding to a distribution of outcomes and settings) that violates the CHSH-Bell inequality by a fixed positive amount, which can be chosen to be as small as desired.
The following theorem gives a useful sufficient condition for a distribution different from ρ to have positive log-prob rate, and demonstrates that any nontrivial (i.e., non-constant) PEF will have at least some degree of robustness.
Theorem 7. Let F (CZ) = G(CZ) β be a non-constant positive PEF with power β > 0 for Π. The log-prob rate O σ (F ; β) at a distribution σ(CZ) ∈ Π is related to the log-prob rate O ρ (F ; β) at ρ(CZ) ∈ Π and the total variation distance between ρ and σ as where L = max cz log 2 (G(cz)) and l = min cz log 2 (G(cz)). Consequently, assuming that O ρ (F ; β) is positive, the following upper bound on the total variation distance between ρ(CZ) and σ(CZ) is a sufficient condition for F to have a positive log-prob rate at σ(CZ) Proof. Using the definition of log-prob rate at a given distribution we have Hence, we have Assuming that O ρ (F ; β) is positive, a sufficient condition for O σ (F ; β) to be positive is O ρ (F ; β) > |L − l|d TV (ρ, σ), or equivalently, the following bound on d TV (ρ, σ): We will see in Section 4.2 that the bound (31) can be saturated, and so is tight.
4 Application to the (2,2,2) Bell scenario Here, we explore the application of the results of the previous section to the (2,2,2) Bell scenario (that of two parties, two measurement settings, and two outcomes). First, working within the trial model of no-signalling distributions Π NS , we show that PEFs can be simultaneously asymptotically optimal and robust by means of an explicit construction of a sequence of PEFs that approaches the optimal log-prob rate for the target distribution while simultaneously generating randomness at a positive rate for any other distribution violating the CHSH inequality.
In the course of this exercise, we will observe that the optimal adversarial attack-one generating the observed statistics (consistent with an expected trial distribution ρ) while asymptotically yielding h min (ρ) amount of per-trial randomness-is always achieved through a single-trial distribution that marginalises to ρ through a convex combination of a single extremal no-signalling non-local distribution and a local realistic distribution (which itself consists of a convex mixture of up to eight extremal local deterministic distributions). This is a notable feature, revealing that the adversary never needs to prepare more than one nonlocal distribution to simulate the observed distribution with as little min-entropy as possible. Later in this section, we explore the potential for generalisation of this feature to the (2,2,2) scenario restricted to quantum distributions (Π Q ); if true, this would be an important finding, outlining the optimal approach of a (more realistic) quantum-limited adversary attacking the PEF protocol. The general observation that preparing a single non-local state is preferable to preparing multiple underlies the significance of the answer to this question. We find some evidence that the feature-only requiring one extremal non-local distribution in the convex combination attack-may hold for the Π Q in the (2,2,2) Bell scenario, but this may be a difficult question to resolve due to the complicated geometry of the quantum set. We also explore possible generalisations of this feature to no-signalling trial models for (n, m, k) Bell scenarios where n, m, or k exceed 2, and find that it does not hold in any of these cases-so the question of whether this holds in a given Bell scenario and trial model is non-trivial in general.
We begin with a brief review of the (2,2,2) Bell scenario and some features of the set Π NS of no-signalling distributions in this scenario.

A brief review of the (2,2,2) Bell scenario
The (2,2,2) Bell scenario is the minimal Bell scenario, comprising of two spatially separated parties Alice and Bob, each having two measurement settings and two possible outcomes corresponding to each setting. The measurement settings for Alice and Bob are represented by the RVs X, Y realising values x, y ∈ {0, 1} and the measurement outcomes are represented by the RVs A, B realising values a, b ∈ {0, 1}. With σ s (XY ) representing a fixed settings distribution, we refer to the sets Π NS , Π Q and Π L as no-signalling, quantum and local models, respectively, when they comprise of distributions µ(ABXY ) := µ(AB|XY )σ s (XY ), where the conditional probabilities µ(AB|XY ), referred to as behaviours, are constrained by the no-signalling, quantum and local realism principle, respectively. Henceforth, all distributions µ(ABXY ) belonging to a model are defined as µ(ABXY ) := µ(AB|XY )σ s (XY ), and we associate a model with its constituent behaviour µ(AB|XY ) or distribution µ(ABXY ), indistinctively, since the settings distribution is fixed. Recall that the model Π NS is a polytope, the extremal points of which consist of the behaviours µ extr (AB|XY ) ≡ µ extr (ab|xy) : a, b, x, y ∈ {0, 1} defined below.
where E xy := 1 a,b=0 (−1) a+b µ(ab|xy) for x, y ∈ {0, 1}. The non-local algebraic maximum for the expression B αβγ is 4. The local maximum is obtained by eight µ αβγδ LD (AB|XY ) behaviours for each B αβγ . The sets LD i , i ∈ {1, 2, . . . , 8}, each comprising of eight LD behaviours saturating-i.e., achieving the value of 2-exactly one B αβγ are shown in Table 2. A result proven in [Bie16] (see Theorems 2.1 and 2.2 therein) states that any behaviour violating (35) can be represented as a convex combination of one PR box achieving the non-local maximum for B αβγ and (up to) eight LD behaviours of the corresponding LD i set saturating it. In fact, the geometry of the no-signalling polytope in this Bell scenario is such that there is a one-to-one correspondence between the non-local no-signalling extremal points, the PR boxes, in (33) and the non-trivial facets of the local polytope described by (35), with exactly one extremal point violating it up to the algebraic maximum of 4 for each choice of (α, β, γ) ∈ {0, 1} 3 . Hence, any non-local behaviour-that violates a given version of the CHSH-Bell inequality-is contained in a non-local 8-simplex whose vertices are the one PR box that maximally violates that particular version and the eight LD behaviours that saturate it. Recall that a p-simplex is a p-dimensional polytope which is the convex hull of its p + 1 vertices. More formally, if the set C := {⃗ a 0 , ⃗ a 1 , . . . , ⃗ a p } ⊂ R n of p + 1 points are affinely independent, then the p-simplex determined by them is the following set of points: The affine independence condition means that the only admissible choice of θ k ∈ R such that p k=0 θ k ⃗ a k = ⃗ 0 and p k=0 θ k = 0 are satisfied is θ k = 0 for all k; this holds if and only if the vectors ⃗ a k − ⃗ a 0 are linearly independent for k = 1, 2, . . . , p.
One can check that the PR box that achieves the non-local maximum for a given version of the CHSH-Bell expression B αβγ and the eight LD behaviours that achieve the local maximum for it are affinely independent. Since a, b, x, y ∈ {0, 1} and |{0, 1} 4 | = 16, we can represent the behaviours µ(ab|xy) in this Bell scenario as vectors ⃗ µ ∈ R 16 as shown in Table 1. Then the affine independence is apparent: letting the PR box behaviour be ⃗ a 0 and the LD behaviours be the other ⃗ a k , each ⃗ a k − ⃗ a 0 term has a unique column where it contains a "1" while all of the other terms contain "0", ensuring linear independence.  Table 1: These probability vectors in R 16 are the PR box ⃗ µ PR,1 ≡ µ 000 PR that achieves the non-local maximum of 4 and the eight LD behaviours ⃗ µ LD,1 , . . . , ⃗ µ LD,8 that achieve the local maximum of 2 for the standard CHSH-Bell expression B 000 , with the LD behaviors corresponding to the eight probability tables numbered 1, 4, 5, 8, 9, 12, 14 and 15 in Table A2 of [Bie16], and also given in the first row of Table 2. One can verify the affine independence of the nine vectors above by verifying that the eight vectors obtained by subtracting the first vector from the remaining eight are linearly independent.

Robust PEFs and optimal adversarial attacks in the (2,2,2) Bell Scenario
We now examine the robustness of PEFs that are optimal for an anticipated distribution ρ and a fixed number of planned trials n. We first review how we find optimal PEFs in this scenario. The constrained maximisation routine in (8) provides a method to find useful PEFs with respect to an anticipated trial distribution, with Lemma 1 showing that the feasibility constraints in (8) can be restricted to only the distributions corresponding to the eight PR and sixteen LD behaviours (with a fixed settings distribution σ s (XY ) > 0). In practice, the number of trials n will affect the choice of β and the PEF that optimises the quantity E ρ [(n log 2 (F (CZ)) + log 2 (ϵ))/β], a quantity which (per the discussion surrounding (8)) can be thought of the anticipated amount of raw randomness from running the experiment whose trial distribution is expected to be ρ. If we divide this quantity by n, we arrive at a measure of expected randomness per trial for the optimal PEF at a given value of β, called the net log-prob rate: the function (max F O ρ (F ; β)) + log 2 (ϵ)/nβ. Figure 2 shows a plot of the net log-prob rates corresponding to two different values of n, as well as the supremum of the log-prob rate, for β varying from 0.001 to 0.1 and ϵ fixed at the value 10 −4 . The value of β, and the corresponding PEF that maximises the curve, is then the best choice for the given planned number of trials n.
The plot illustrates some notable features of PEFs. First, it was proved in Appendix D of [ZKB18] that assuming a stable experiment (with each trial distribution ρ) the function sup F O ρ (F ; β) is monotonically non-increasing in β > 0 † which implies that the global † The proof that β ′ < β implies sup F O ρ (F ; β ′ ) ⩽ sup F O ρ (F ; β) is straightforward: write β ′ = γβ with supremum of the log-prob rates sup β>0 sup F O ρ (F ; β), for all PEFs with positive powers, is achieved in the limit β → 0. We observe this with the top curve. For a fixed ϵ, the net logprob rate converges upwards to sup F O ρ (F ; β) for each β as n → 0 , but for any fixed value of n, log 2 (ϵ)/nβ diverges to −∞ as β → 0. Hence in a finite trial regime the supremum of the log-prob rates (attainable by PEFs with positive powers) is not achieved-the maximum value of the net log-prob rate is achieved at a β away from 0. The general trend is that for a value of n the net log-prob rate achieves a higher value corresponding to a lower value of β; the net log-prob rate is improved by a reduction in power and an increase in the number of trials. This is observed in Figure 2 for the two choices of n = 1.5 × 10 5 and n = 2.4 × 10 5 .
The arguments above illustrate how it is necessary to consider a range of β values to find the optimal choice. We remark there is an upper limit to the range of β values that must be considered: it was noted in in [ZKB18] (see Appendix F therein) that there exists a certain threshold value β NS th such that for all β ⩾ β NS th , the optimisation problem in (8) will return the same PEF independent of the choice of β, and [ZKB18] cites numerical evidence that this bound is β NS th ≃ 0.4151. The following result, whose proof we give in the appendix, derives this threshold analytically, finding it to have the exact value log 2 (4/3).
Proposition 1. For the set of behaviours Π NS , the PEF optimisation in (8) is independent of the power β for β ⩾ log 2 (4/3).

Proof. See Appendix F.
We now ask how optimal PEFs for lower and lower values of β (and correspondingly higher values of n) compare on the question of robustness, in the following sense: can a PEF optimised with respect to a distribution ρ violating the standard CHSH-Bell inequality be used to certify randomness of distributions that are different from ρ, provided they violate the same CHSH-Bell inequality? This question is relevant because in practice, the observed experimental distribution will never be exactly the same as the anticipated one, and may be somewhat different depending on many potential factors. Figure 3 gives an illustration of the matter of robustness. Comparing the two plots of the log-prob rate for quantum-realisable distributions on the two-dimensional slice (shown in Figure 4b) above the standard CHSH-Bell facet, we observe that the level set denoting a zero amount of certified randomness in the right hand plot (which corresponds to a lower value of β than that on the left) is pushed further down to (almost touching) the standard CHSH-Bell facet.
This suggests that the asymptotic optimality of a PEF need not entail a trade-off with its robustness; indeed we observed that in many cases, as β > 0 assumes smaller and smaller values, the PEF optimised for a fixed ρ violating the standard CHSH-Bell inequality gets more and more robust in the sense that it certifies randomness at a positive rate (asymptotically) for increasingly statistically different σ.
We show that this is a general feature. To this end, we define a sequence of PEFs that is both asymptotically optimal with respect to the log-prob rate and is asymptotically robust 0 < γ < 1; then for any F in the scope of sup F O ρ (F ; β), it turns out F γ is a PEF with power β ′ , for which the equality O ρ (F γ ; β ′ ) = O ρ (F ; β) follows immediately from the definition of log-prob rate -hence the supremum of log-prob rates cannot be smaller at β ′ . F γ is a PEF with power β ′ as E ρ (F γ σ(c|z) βγ ) ⩽ E ρ (F σ(c|z) β ) γ ⩽ 1 γ = 1, with the first inequality holding by Jensen's inequality (f (x) = x γ is concave) and the second because F is a PEF with power β. The dotted curve is the log-prob rate sup F O ρ (F ; β), an upper bound for the net log-prob rate in the limit as n → ∞. We selected 200 equally spaced points in the interval (0.001, 0.1) for β and performed the maximisation max F E ρ [log 2 (F (ABXY ))] constrained by: (1) the non-negativity of PEFs and (2) the defining condition E µ [F (ABXY )µ(AB|XY ) β ] ⩽ 1 at all distributions µ corresponding to the eight PR and sixteen LD behaviours with a fixed uniform settings distribution µ(xy) = 1/4 for all x, y ∈ {0, 1}. The anticipated distribution ρ used here was the one corresponding to the behaviour given in Table I in [KZB20]. We observe that the maximum value for the net log-prob rate-indicated by the solid vertical lines-is achieved at a lower value of β for a higher value of n.
in the sense that given any distribution violating the standard CHSH-Bell inequality, all the PEFs beyond a point in the sequence certify randomness at a positive rate. To construct this PEF sequence, we first define the function K * (ABXY ) as shown below: The function defined in (36) is an entropy estimator for the distributions in the nosignalling polytope when the settings are equiprobable; i.e., σ s (xy) = 1/4 for all choices of x and y. To see this, recalling Definition 6 we can check-by direct evaluation-if K * satisfies the inequality E σ [K(CZ)] ⩽ E σ [− log 2 (σ(C|Z))] when σ is each of the extremal points of the no-signalling polytope. It is sufficient to check this condition for the extremal points of the no-signalling set, i.e., the PR behaviours and the LD behaviours. This is because if σ is expressible as σ = λσ 1 + (1 − λ)σ 2 then for any function K sat- Hence if the condition holds for the Fig. 3: A heat map illustrating the robustness of PEF with log-prob rate as the figure of merit, evaluated for behaviours σ(ab|xy) on the two-dimensional slice of the set of quantum behaviours (shown in Figure 4b) above the standard CHSH-Bell facet. The behaviours on the 2-dimensional slice shown above are parametrised by S and S ′ as shown in (44) with the added restrictions S 2 + (S ′ ) 2 ⩽ 8 and 2 ⩽ S ⩽ 2 √ 2, −2 ⩽ S ′ ⩽ 2 (see also Table 3). Assuming a uniform distribution for the settings, σ s (xy) = 1/4 for all x, y, we plot the log-prob rate abxy [log 2 F * (abxy)σ(ab|xy)σ s (xy)]/β for all distributions in the slice. The black dot corresponds to the behaviour (and hence the distribution) with respect to which we perform the PEF optimisation for a fixed n and ϵ to obtain F * . The coordinates for the black dot are (S ′ , S) ≡ (0, 2.6). extremal points, it will hold for all points in the set. To see that it does, we confirm by inspection that E σ [K * ] attains the value 1 for the PR behaviour achieving the no-signalling maximum for the standard CHSH function, the value −3 for the PR behaviour achieving −4, and the value −1 for each of the PR behaviours that achieve the value 0, which are all less than or equal to the conditional Shannon entropy of the respective PR behaviours, which is 1. Likewise, we can check that K * is a valid entropy estimator for all the LD behaviours; it takes the value zero for the eight local deterministic distributions appearing in Table 1 and −2 for the other eight, while H(AB|XY ) = 0 for these distributions. Hence, we have verified that K * satisfies the entropy estimator condition for all the extremal behaviours, and by extension all behaviours in the no-signalling polytope.
Having shown K * is a is an entropy estimator, we next consider a sequence of functions {F k } ∞ k=1 where F k is defined according to the construction in Theorem 6: where we choose a positive β k making F k a PEF for each k, whose existence is guaranteed by the theorem. By construction, for each k the function F k is a valid PEF with power β k > 0 for the set of no-signalling distributions. The log-prob rate of F k at σ is: We show robustness of the sequence in the following sense: for any σ ∈ Π NS violating the standard CHSH-Bell inequality, the log-prob rate of the sequence of PEFs {F k } ∞ k=1 is eventually positive. To see this, recall that as discussed in our brief review of the (2,2,2) Bell scenario, behaviours violating the standard CHSH-Bell inequality are contained in the non-local 8-simplex ∆ 8 PR,1 (see Table 2). Hence, σ is expressible as a convex combination of the vertices of ∆ 8 PR,1 : where λ PR,1 + 8 i=1 α i = 1. This decomposition allows us to express the log-prob rate in terms of the standard CHSH-Bell function, which we define as S(ABXY ) = (−1) XY (−1) A+B /σ s (XY ), where σ s (XY ) is the fixed settings distribution. We see that λ PR,1 = (S σ −2)/2 in (39), where S σ is the expected standard CHSH-Bell value according to the distribution σ(ABXY ) = σ(ab|xy)σ s (xy). This follows by computing the expectation of S according to the PR Box distribution µ PR,1 = µ PR,1 (abxy) = µ PR,1 (ab|xy)σ s (xy), which is 4, and the expectation of S according to the local distribution µ L,i = µ L,i (abxy) = µ LD,i (ab|xy)σ s (xy), which is 2. The log-prob rate O σ (F k ; β k ) for F k at σ is then expressed as: Since E µ LD,i [K * ] evaluates to zero for each µ LD,i and E µ PR,1 [K * ] evaluates to 1, the expression for O σ (F k ; β k ) reduces to O σ (F k ; β k ) = Sσ−2 2 − e −k . As k → ∞, O σ (F k ; β k ) = (S σ − 2)/2 and so the quantity is eventually strictly positive provided S σ > 2, i.e., provided σ violates the standard CHSH-Bell inequality.
Continuing our discussion on robustness, a different perspective on it would be to ask: given a PEF F with power β > 0 optimised with respect to the distribution ρ, how far in terms of total-variation distance can another distribution σ be such that the same PEF (with the same power) can be used to certify randomness? Theorem 7 provides a sufficient condition for the robustness of a positive, non-constant PEF F = G β with power β in the following sense: assuming the log-prob rate of F at ρ is positive, the log-prob rate of F at a different distribution σ is positive if d TV (ρ, σ) is within a certain bound (as given in (32)). For the sequence {F k } ∞ k=1 of PEFs the upper-bound on d TV (ρ, σ) is computed as follows: Notice that in the sequence The upper-bound on d TV (ρ, σ) (as given in (32) It is worthwhile to observe that given a standard-CHSH Bell inequality violating distribution ρ, this upper-bound approaches the strength of non-locality for ρ which is expressed as (S ρ − 2)/8. The strength of non-locality is defined in terms of how far the non-local nosignalling distribution ρ is from the local set Π L [BAC18]. It is defined as follows: where the minimum is over all distributions τ belonging to the local set Π L . In the definition of d NL (ρ) in (41) we have assumed a uniform settings distribution as is evident from the factor 1/|X ||Y|, where |X | and |Y| denote the number of the measurement settings choices for Alice and Bob, respectively (which for the (2,2,2) Bell scenario is 2 for Alice and 2 for Bob). A theorem in [Bie16] (see Theorem 3.1) provides a condition for the local distribution τ such that the minimum (1/2) min τ ∈Π L abxy |ρ(ab|xy) − τ (ab|xy)| in (41) is achieved and that the minimum comes out to be the weight (S ρ − 2)/2 on the PR-box in the expression of ρ as the convex combination of the vertices of ∆ 8 PR,1 ; and so per the definition in (41) d NL (ρ) = (S ρ − 2)/8. Thus, the bound 1 4 Sρ−2 2 − e −k from Theorem 7 approaches Sρ−2 8 which is the strength of non-locality d NL (ρ) for ρ. This illustrates that a bound of this form cannot be improved, in the sense that increasing the total variation distance from ρ by any positive amount will encompass local distributions which cannot certify randomness.
Thus {F k } ∞ k=1 is fully robust as k → ∞. Next, we confirm that {F k } ∞ k=1 is asymptotically optimal in terms of min-entropy per trial (i.e., log-prob rate), for any distribution σ violating the standard CHSH inequality. Since Π NS is closed and equal to the convex hull of its extremal points, Theorem 4 implies that given such a σ, the adversary has a strategy obtained through an IID attack based on a single-trial distribution whose conditional Shannon entropy is equal to the infimum defined in (24). We can identify this attack. The optimisation in (24) can be expressed as follows: where ν e = ν(ABXY |e). We compute H ν (AB|XY E) for the decomposition of σ given in (39), where we have noted λ PR,1 = (S σ − 2)/2. Since the conditional Shannon entropy is one for PR boxes and zero for LD behaviours, we obtain H ν (AB|XY E) = (S σ − 2)/2, and hence h min (σ) is no larger than this value. But since this expression is same as that of the asymptotic log-prob rate of the sequence {F k } ∞ k=1 of valid PEFs, we can say h min (σ) is also no smaller than this value, and so h min (σ) = (S σ − 2)/2. This demonstrates the asymptotic optimality of the sequence {F k } ∞ k=1 in the sense that the PEFs in the sequence get arbitrarily close to certifying an asymptotic randomness rate of h min (σ).
In our proof of the asymptotic optimality of the sequence {F k } ∞ k=1 , we identified the optimal attack by an adversary: it is to prepare the decomposition in (39) with each e corresponding to one of the (up to) nine extremal behaviours, with respective ν(e) weights of λ PR,1 and α i . This can be seen to be the unique attack achieving h min (σ), through an argument we sketch as follows: (1) any ν-decomposition of σ can be improved upon (i.e., reducing H ν (AB|XY E)) by considering only extremal ν e , by the concavity of conditional Shannon entropy; (2) any decomposition including positive weights on more than one PR box can be strictly improved upon by one with weights on a single PR box, by Theorem 2.1 of [Bie16], which shows how to replace equal mixtures of two PR boxes with mixtures of a single PR box and local deterministic distributions; (3) this decomposition can be further strictly improved via Theorem 2.2 of [Bie16] by removing any local deterministic distributions not saturating the CHSH-Bell inequality with those that do (the improvement being obtained by decreasing the weight on the sole remaining PR box). The resulting decomposition-that of (39)-is thus the unique optimiser of (42). It witnesses the bound of 1 + dim(Π NS ) = 1 + 8 = 9 on the set E (as shown in Theorem 4). In general, positive weight on all 9 extremal boxes may be necessary, due to their affine independence which was noted in Section 4.1. One can confirm this visually from Table 1: weight on the (only) non-local distribution, the PR box, is necessary to violate the CHSH-Bell inequality, and any distribution with non-zero probabilities for each possible outcome (a property possessed by, for example, the quantum distribution saturating Tsirelson's bound) will require positive weight on all the local deterministic behaviours, as each LD behaviour corresponds to a distinct sole appearance of the number "1" in a column otherwise populated by zeroes in Table 1. This witnesses that further reduction of the 1+dim(Π NS ) bound on |E| in Theorem 4 is impossible, and so this bound is optimal.
It is an important observation that the adversary needs to prepare only one non-classical state in her realisation of the optimal attack, since the preparation of a non-classical state is likely the most difficult aspect of the attack. We now explore possible generalisations of this feature to other trial models.

Characterising the optimal attack in different scenarios
We start by exploring the possibility of arriving at a similar analytic characterisation of the optimal adversarial attack when the adversary is limited to only quantum-realisable distributions. Suppose now that our trial model is the set Π Q of quantum-achievable distributions for the (2,2,2) scenario. The adversary is still constrained to performing probabilistic attacks to simulate the trial statistics, while generating the least amount of randomness possible; however, she now tries to mimic the trial statistics using quantum-achievable distributions. The optimisation routine depicting this goal is: where ω e = ω(ABXY |e). The set Π Q is compact and convex, but unlike Π NS , is not a polytope and so there is a continuum of extremal points. We conjecture that the minimum in (43) is achieved at a distribution that marginalises to the observed trial distribution through a convex combination of (only) one quantum extremal distribution violating the standard CHSH-Bell inequality and no more than eight local deterministic distributions that saturate the same inequality.
An attempt to prove this will require an understanding of the geometry of the quantum set, and in particular its extremal points. We do not yet have a complete characterisation of the set of behaviours Π Q (in the true R 8 space), although a recent work has conjectured an analytic criterion for extremality in the CHSH scenario [MK23]. However, a characterisation does exist when we make the assumption of unbiased marginals: µ(A = 0|x) = µ(A = 1|x) = 1/2 for all x ∈ {0, 1} and µ(B = 0|y) = µ(B = 1|y) = 1/2 for all y ∈ {0, 1}, in which case the set of behaviours is four dimensional. The unbiased marginal case has been completely characterised, a detailed exposition of which can be found in [Le+23] (see Theorem 1 therein).
where ⃗ µ 0 is the maximally random behaviour obtained as the equal mixtures of all 16 local deterministic behaviours. The disk S 2 + (S ′ ) 2 ⩽ 8 represents the set of quantum behaviours. (b) The portion of the 2-dimensional slice containing the no-signalling (including quantumachievable) behaviours above the standard CHSH-Bell facet. For a fixed behaviour ⃗ µ Q in the interior of the quantum region, the darker shaded region corresponds to possible ways of expressing ⃗ µ Q as a convex combination of a behaviour on the quantum boundary and a behaviour on the local boundary (for example, ⃗ µ Q = λ⃗ ν Q + (1 − λ)⃗ ν L , λ ∈ (0, 1)). For the same behaviour ⃗ µ Q , the lighter shaded region represents possible ways of expressing it as a convex combination of two behaviours on the quantum boundary (for example, ⃗ µ Q = δ ⃗ θ Q,1 + (1 − δ) ⃗ θ Q,2 , δ ∈ (0, 1)).
A key enabling step in the direction of characterising the optimal attack in the unbiased marginals case would be to see if the following two conditions hold simultaneously: first, a convex combination of any two extremal quantum behaviours can be expressed equivalently as a different convex combination of one extremal quantum behaviour (different from the previous two) and classical noise (mixtures of the local deterministic behaviours), i.e., for extremal quantum behaviours ⃗ µ 1 , ⃗ µ 2 , the convex combination λ⃗ µ 1 + (1 − λ)⃗ µ 2 can be reexpressed as the convex combination δ⃗ µ 3 +(1−δ)⃗ µ 0 , where λ, δ ∈ (0, 1), ⃗ µ 3 is a third extremal quantum behaviour, and ⃗ µ 0 is a mixture of the local deterministic behaviours; and second, λH µ 1 (AB|XY )+(1−λ)H µ 2 (AB|XY ) ⩾ δH µ 3 (AB|XY ), where the term (1−δ)H µ 0 (AB|XY ) that might be expected to appear on the right vanishes due to the concavity of conditional Shannon entropy and the fact that it is zero for local deterministic behaviours, into which ⃗ µ 0 can be decomposed.
A numerical inspection to check-by means of an exhaustive search-if these two conditions hold simultaneously (in the uniform marginals case) introduces a lot of free variables. If we add more symmetry to the behaviours with uniform marginals and constrain ourselves to the 2-dimensional slice as shown in Figure 4a, † where the behaviours are given by the formula (44) and are of the form displayed in Table 3, then one can perform numerical search to see if the two conditions mentioned above hold simultaneously, and we did observe it to hold in some initial numerical investigations comparing the ⃗ θ decompositions against ⃗ ν decompositions as depicted in Figure 4b.  Table 3: Tabular representation of the no-signalling behaviours on the 2-dimensional slice shown in Figure 4a. The behaviours have uniform marginals, i.e., the probability of observing an outcome conditioned on a measurement setting is 1/2 for each party for all outcomes and settings. The behaviours are further constrained in having the third and fourth row completely determined by the first and second, which need not hold in general for uniform marginal distributions, and brings the dimensionality down from four to two. Any behaviour represented as above is parameterised as the values S and S ′ of the two versions of the CHSH-Bell expression E 00 + E 01 + E 10 − E 11 and −E 00 + E 01 + E 10 + E 11 , respectively: s 1 = (4 + S − S ′ )/16, s 2 = (4 + S ′ − S)/16, s 3 = (4 + S + S ′ )/16, s 4 = (4 − S − S ′ )/16, where for the no-signalling set −4 ⩽ S ′ + S ⩽ 4, −4 ⩽ S ′ − S ⩽ 4 and for the quantum set S 2 + (S ′ ) 2 ⩽ 8.
Going beyond the minimal Bell scenario, we considered the possibility of a similar characterisation of optimal no-signalling adversarial attack in higher (n, m, k) Bell scenarios. In the (2,2,2) Bell scenario the analytical characterisation of the optimal adversarial attack crucially relied upon the geometric features of the no-signalling polytope, namely Theorems 2.1 and 2.2 in [Bie16]: that equal mixtures of two PR behaviours are expressible as equal mixtures of four distinct LD behaviours and consequently, a behaviour violating any of the eight versions (up to local relabelling of the outcomes and settings) of the CHSH-Bell inequality is expressible as a convex combination of the one PR behaviour achieving the non-local maximum and † This can be done as follows: A behaviour with uniform marginals can be completely specified by the correlators (E 00 , E 01 , E 10 , E 11 ), where −1 ⩽ E xy ⩽ 1, ∀x, y (see the line following (35) for the definition of E xy ). To obtain behaviours in the 2-dimensional slice as shown in Figure 4a one can restrict attention to distributions of the form µ(ab|xy) = 1 4 (1 + (−1) a+b C xy ) where C 00 = −C 11 = E00−E11 2 and C 01 = C 10 = (up to) eight LD behaviours achieving the local maximum of the corresponding CHSH-Bell expression. These geometric features, however, do not extend to the no-signalling polytopes of higher (n, m, k) Bell scenarios. Membership of equal mixtures of extremal no-signalling non-local behaviours in the local polytope holds solely in the (2,2,2) Bell scenario. Below we provide examples of equal mixtures of no-signalling non-local extremal behaviours in the (2, 2, 3), (2, 3, 2) and (3, 2, 2) Bell scenarios that do not belong to the local polytope. One can use linear programming to check non-locality of the such examples. Assessment of locality of a behaviour is an instance of the membership problem of the local polytope. Since the local deterministic (LD) behaviours are the extremal points of the local polytope, we can formulate our problem as a feasibility linear program. Suppose {⃗ µ LD,1 , ⃗ µ LD,2 , . . . , ⃗ µ LD,# LD } is the set of LD behaviours for some Bell scenario. The vector ⃗ µ LD,i ∈ R d denotes the joint probability of outcomes conditioned on the input choices and d is the dimension of the ambient space in which the vector lies. The feasibility linear program has the variable ⃗ x ∈ R # LD . The inequality constraints comprise of x i ⩾ 0, i ∈ [# LD ] and the equality constraints are # LD i=1 x i = 1 and the following: where − → NS extr is a non-local no-signalling extremal behaviour. The details on formulating the dual of this linear program can be found in section E.2.1 of the Appendix of [Sca19].
The extremal points of the no-signalling polytope comprise of the local deterministic (LD) behaviours and the non-local extremal behaviours. The LD behaviours consist of all possible assignments The number of such assignments is # LD = (|A|) n|X | . Corresponding to each assignment λ ∈ Λ LD the LD probabilities are expressed as µ LD,k (a 1 a 2 . . . a n |x 1 x 2 . . .
where [[·]] is the function that evaluates to 1 if the condition within holds, 0 otherwise. A behaviour ⃗ µ L is local if it can be expressed as ⃗ µ L = # LD k=1 q k ⃗ µ LD,k , where q k ⩾ 0 and # LD k=1 q k = 1.
(2, 2, 3) Bell scenario: This scenario is an instance of the more general (2, 2, k) scenario, also known in the literature as the CGLMP scenario [Col+02], for k = 3. In this bipartite scenario the parties have two 3-output choices of settings. The extremal behaviours for the no-signalling polytope for the CGLMP scenario have been fully described in [Bar+05]. The non-local no-signalling extremal behaviours for the (2, 2, 3) scenario, up to relabelling of inputs and outcomes, are given by the following formula: where a, b ∈ {0, 1, 2} and x, y ∈ {0, 1} are the outputs and inputs for the parties, respectively. We found that (45) does not necessarily hold for all equal mixtures of a pair of distinct nonlocal extremal behaviours. Among the several examples we found that violate (45), Table 4 shows one such example. Table 4: Two non-local extremal behaviours for the CGLMP scenario with 3 outcomes whose equal mixtures is non-local. The inputs x, y ∈ {0, 1} and the outcomes a, b ∈ {0, 1, 2} with x ′ = x ⊕ 1, y ′ = y ⊕ 1 and a ′ = a ⊕ 3 1, a ′′ = a ⊕ 3 2, b ′ = b ⊕ 3 1, b ′′ = b ⊕ 3 2. The symbol ⊕ denotes addition modulo 2 and ⊕ 3 denotes addition modulo 3. The missing entries correspond to 0. The top behaviour comes directly from (47) while the bottom behaviour is obtained through the relabelling x ↔ x ′ and y ↔ y ′ . An equal mixture of these two boxes lies outside the local polytope.
(3, 2, 2) Bell scenario: This is a tripartite scenario with each party having binary input choices and outcomes. The no-signalling polytope consists of 46 inequivalent classes of extremal behaviours, of which one is the class comprising of 64 LD behaviours. A complete characterisation can be found in [PBS11]. As an example violating (45) we can refer to the observation made in Section 2.3 of [PBS11] that equal mixtures of two behaviours in Class 46 (see Table 1 of [PBS11]) is a GHZ correlation which is expressed (entirely in terms of correlators ]⟨A x B y C z ⟩) as P GHZ (abc|xyz) = 1 8 (a + abc⟨A x B y C z ⟩). ⃗ P GHZ is a non-local behaviour which is obtained by measuring 1 √ 2 (|000⟩ + |111⟩) in suitable local bases [GHZ07].

Conclusion
In this work, we revisited the probability estimation framework with the goal of presenting a complete and self-contained proof of its optimality in the asymptotic regime and obtaining a better characterisation of optimal adversarial attack strategies on the protocol. We obtained in Theorem 4 an improved and tight upper bound on the cardinality of the set of states needed in the optimal attack, and studied the implications of this result for specific scenarios in Section 4. We also considered the question of robustness for the PEF method, finding that asymptotic optimality of PEFs (in terms of randomness generation rate) need not entail a trade-off with robustness to small deviations from expected experimental behaviour.
In proving the optimality of the framework, our results show that there remains nothing to be gained, asymptotically, for an adversary implementing memory attacks-an i.i.d. attack is asymptotically optimal. However, in real world applications this may not hold. The number of trials in a Bell experiment are finite, albeit large, and there are unavoidable correlations between the successive trials (referred to as memory effects). We leave to future work considerations of side-channel attacks in the non-asymptotic (finite trials) regime for the probability estimation framework.

A Proofs for Theorems 1 and 2
First, we present the proof for Theorem 1.
Theorem. Suppose µ : C n × Z n × E → [0, 1] is a distribution of CZE such that µ e (CZ) ∈ Θ for each e ∈ E. Then for fixed β, ϵ > 0 holds for each e ∈ E, where F i (C i Z i ) is the probability estimation factor for the i'th trial.
Proof. The sequence of random variables C, Z represent the time-ordered sequence of n trial results. For the remainder of the proof we omit conditioning on E = e since the result holds for each realisation. Hence, µ(· · · ), P µ (· · · ) and E µ [· · · ] must be understood to mean µ e (· · · ), P µe (· · · ) and E µe [· · · ].
Observe that for any i ∈ {1, ..., n − 1} we have where the first equality is an elementary manipulation of conditional probabilities and the second equality follows from with the second step above following from from the second condition in (1), applied directly in the numerator and in the denominator via Now consider the sequence Q i = µ(C ⩽i |Z ⩽i ) β i j=1 F j , for i ⩾ 1, where we note Q i is a random variable that is determined by C ⩽i , Z ⩽i . We begin by showing that conditioned on C ⩽i , Z ⩽i the expectation of Q i+1 is at most Q i for all i ∈ {1, ..., n − 1}. Applying (49), we can write where the fact that Q i is determined by C ⩽i , Z ⩽i allows us to pull it out of the conditional expectation, and the inequality follows from the fact that for all realisations c ⩽i , z ⩽i of C ⩽i , Z ⩽i , as ensured by Definition 1. We remark that Q i is a super-martingale as indicated by the inequality in (50). † Now, using the law of iterated expectation we obtain: Since Q 1 equals µ(C 1 |Z 1 ) β F (C 1 Z 1 ), it satisfies E µ [Q 1 ] ⩽ 1 directly from Definition 1, and so repeated applications of (51 we can use Markov's inequality and obtain the required result as shown below.
Next, we present the proof for Theorem 2.
Theorem. Let µ be a distribution µ : C n × Z n × E → [0, 1] of CZE such that for each e ∈ E, the following holds for every ϵ ∈ (0, 1): where F i is a PEF with power β for the i'th trial. For a fixed choice of ϵ ∈ (0, 1) and p ⩾ |C| −n , define the event S := (ϵ n i=1 F i ) −1/β ⩽ p . Then if κ is a positive number for which P µ (S) ⩾ κ, the following holds: H avg,ϵ/κ ∞,µ (C|ZE; S) ⩾ log 2 (κ) − log 2 (p) (53) † The term F i µ(C i |Z i Z ⩽i−1 C ⩽i−1 ) β is non-negative, is determined by C ⩽i , Z ⩽i and satisfies Proof. The goal is to construct a distribution ω of CZE such that it is within ϵ/κ TVdistance from µ(CZE|S), and such that the average conditional maximum probability of C conditioned on (and averaged over) ZE is bounded below by p/κ. We will construct ω to satisfy ω(Cze|S) = 0 for all values of z and e for which µ(ze|S) = 0. Hence for the rest of the construction, we will restrict attention to cases where µ(ze|S) > 0. We will use expressions such as P µe (S) and µ e (S) interchangeably. We start by defining the event , whose occurrence or non-occurrence is determined by the particular realisation of e, c, and z. The event R corresponds to the desired probability bound holding; (52) ensures that this event occurs with high probability, and we will construct our distribution ω to, in an intuitive sense, extend this desirable behaviour from R ∩ S to all of S. We begin the construction by defining, for each fixed e satisfying µ e (S) > 0, a nonnegative function f : C n × Z n → R + as shown below.
The weight w of f , defined as w(f ) = c,z f (cz), satisfies w(f ) ⩽ 1 as shown below: ] is equal to 1, if the condition or expression within holds, 0 otherwise. (Note that f is a sub-probability distribution on cz: a set of non-negative numbers whose sum is less than or equal to 1. Defining a sub-probability distribution is a standard trick to construct a distribution by invoking certain lemmas.) Below we show that w satisfies w(f ) ⩾ 1 − ϵ/P µe (S).
where in (55) we have used the fact that P µe (R) ⩾ 1 − ϵ holds for each e ∈ E, as follows from (52). Next, we define a non-negative functionf z : C n → R + for each z ∈ Z n for which µ e (z|S) > 0:f We show below that for each such z,f z (c) is bounded by µ e (c|z, S), ∀c ∈ C n . We have: Above, we have used the fact that the event S ∩ R implies µ e (C|Z) ⩽ (ϵ n i=1 F i ) −1/β ⩽ p. The bound pµ e (z)/µ e (z, S) ⩾ p ⩾ |C| −n also holds, since µ e (z)/µ e (z, S) ⩾ 1. Hence, using the lemmas in Section D we can construct, for each z under consideration, a distribution µ ′ where w(f z ) ⩽ 1 is the weight off z (C). Now we are ready to define the distribution ω(CZE) as We show that the total variation distance between ω and µ(CZE|S) is bounded by ϵ/κ and that the average ze-conditional maximum probability of C is bounded by p/κ. ϵµ(e) µ(S) The equality in (58) follows because ω(cze) = µ(cze|S) = 0 for values of e, z removed from the sums, and µ ′ c (c) is defined for the remaining values of e, z. In (59) we add and subtract withf (c) inside the absolute value expression in the previous step and use the triangle inequality, following which we use the facts established above that both µ ′ z (C) and µ(C|ze, S) = µ e (C|z, S) dominatef z (C). (60) follows from the fact that µ ′ z (c)µ e (z|A) and µ(c|ze, S) sum to 1 over c (being distributions), and (61) follows fromf z (c) = f (cz)/µ e (z|S) and the fact that f (cz) = 0 in cases where µ e (z|S) = 0. Finally, the first inequality in (62) follows from (55) and the last inequality follows from P µ (S) ⩾ κ. Next, we show the upper bound on the average conditional maximum probability.
Hence, we have obtained an upper bound on the average conditional maximum probability in (63). Since by definition the ϵ/κ-smooth average conditional min-entropy involves a maximum (over the set B ϵ/κ (µ)) of the quantity on the left hand side of (64), the final result follows.

B Proofs using Convex Geometry
Here we prove Theorems 4 and 5 using arguments from convex geometry.
Theorem. Suppose Π is closed and equal to the convex hull of its extreme points. Then there is a distribution ω(CZE) ∈ Σ E with |E| = 1 + dim Π such that H µ (C|ZE) = h min (ρ(CZ)).
Proof. We will be analysing h min (·) as a function with domain Π. It is useful to re-write h min (·) in the form h min (ρ) = inf where the infimum is taken over all finite subsets {σ i } i∈I ⊆ Π for which i∈I p i σ i = ρ for some collection of non-negative p i summing to 1. † We first observe that the scope of the infimum can be reduced to consider only sets of σ i belonging to Π extr , the set of extreme points of Π. This follows from the fact that conditional Shannon entropy is concave. † Hence any expression in the scope of the infimum defining h min can always be decreased (or at least unchanged) by replacing each σ i in the expression with a convex combination of extremal behaviours replicating σ i . Π is a subset of R N where N = |Z| × |C| is the number of conditional probabilities appearing in the behaviour. In general, N is strictly larger than dim Π: the constraint that † This is equivalent to the earlier definition if we set ω(CZ|e i ) = σ i (CZ) and ω(e i ) = . † The proof of theorem 43 in [KZB20] correctly notes that the concavity of conditional Shannon entropy can be obtained as a specialisation of the concavity of the quantum conditional entropy. It is worth noting, however, that the classical (only) result can be obtained much more quickly and directly as shown in Appendix C certain elements of Π need to form valid probability distributions reduces the dimension, and no-signalling equalities can reduce the dimension further. So we seek to re-parametrise the elements of Π using only the number of coordinates necessary based on its dimension. The (affine) dimension of Π is by definition the dimension of the smallest affine space containing it -that is, the intersection of all affine subspaces of R N containing Π, which is itself affine space. Let us call this smallest affine space A. If dim A = m, then there is a set of m linearly independent vectors ⃗ v i and a displacement/base vector ⃗ b such that any σ ∈ A has a unique representation as For any σ ∈ Π ⊆ A, then, we can uniquely represent σ as a vector of these coefficients, (c 1 , c 2 , ..., c m ).
We would like to construct an affine-linear map g : R N → R m whose restriction to A maps the N -coordinate vector σ to its m-coordinate representation (c 1 , c 2 , ..., c m ). † Our affine-linear map will be represented by a matrix M and a vector ⃗ k such that g(σ) = M σ+ ⃗ k = (c 1 , ..., c m ). To construct M and ⃗ k, let A be the N × m matrix whose m columns are the vectors ⃗ v i appearing in (65). Since the columns of A are linearly independent, A T A is invertible as its kernel consists only of the zero vector: We can thus define M = (A T A) −1 A T which will satisfy M A = I (M is a pseudo-inverse of A), and so M maps the vectors ⃗ v i to the standard basis vectors in R m . Setting ⃗ k = −M ⃗ b yields the desired g(·).
We point out a couple of properties of g that we will use in our arguments. First, it commutes with convex combinations: For a set of non-negative p i satisfying i p i = 1 and a collection of elements σ i of A, which follows directly from expressing g as M (·) + ⃗ k and noticing that i p i ⃗ k = ⃗ k. Second, M is injective when restricted to A, so consequently g is a bijection between A and R m and in particular g Now, let us consider the following subset of R m+1 , † Ξ extr = {(g(σ), H σ (C|Z)) : σ ∈ Π extr }, † Our approach here makes explicit the arguments only alluded to in the proof of Theorem 43 in [KZB20] through general referral to existence and extension theorems in convex analysis, and takes full advantage of the fact that we are always working in a large ambient R n , allowing us to harness the strength of linear algebra.
† The development here is inspired by the arguments in the appendix of [Uhl98], though the assumptions and conclusions differ somewhat where the first m coordinates of an element of Ξ extr are the coordinates of g(σ) and the m + 1 coordinate is H σ (C|Z). Define where 'conv' denotes the convex hull. Ξ extr and Ξ are artificial constructions, but by studying their geometry we can prove the existence of a convex combination achieving the infimum defining h min (ρ). We first confirm that Ξ extr is indeed the set of extremal points Ξ (as suggested by our choice in names), i.e., we confirm that Ξ extr contains only trivial convex combinations of its elements. To see this, note if i p i (g(σ i ), H σ i (C, Z)) = (g(σ), H σ (C, Z)) holds for some σ i , σ ∈ Π extr and non-negative p i satisfying i p i = 1, then we must have i p i g(σ i ) = g(σ) and so i p i σ i = σ by (66) and (67). This can only be a trivial convex combination (i.e., all σ i with nonzero p i coefficient must equal σ) as the σ i and σ are assumed to be in Π extr .
Second, we show that the point (g(ρ), h min (ρ)) is on the boundary of Ξ; i.e., that (g(ρ), h min (ρ)) is a limit point of Ξ and also a limit point of Ξ C . To see that we can converge to this point from within the set, note that for any set of σ i ∈ Π extr satisfying i p i σ i = ρ, we have by definition i p i (g(σ i ), H σ i (C|Z)) ∈ Ξ which can be re-expressed as (g(ρ), i p i H σ i (C|Z)) ∈ Ξ by invoking (66). By the nature of the infimum defining h min (ρ), there must be a sequence of such elements of Ξ whose last component forms a non-increasing sequence converging to h min (ρ); since the first m components are identically g(ρ), this sequence converges to (g(ρ), h min (ρ)) as desired. Similarly, one can also converge to (g(ρ), h min (ρ)) from outside the set Ξ: (g(ρ), h min (ρ) − ϵ) / ∈ Ξ for all ϵ > 0; this is because all elements of Ξ take the form for some collection σ i ∈ Ξ extr and if the first m coordinates are equal to g(ρ), then by (66) and (67) we must have i p i σ i = ρ and so the m + 1 coordinate is a term contributing to the infimum defining h min (ρ); it cannot be less than h min (ρ).
We now would like to demonstrate that (g(ρ), h min (ρ)) is contained in Ξ. As a first step, we show that (g(ρ), h min (ρ)) ∈ conv(Ξ extr ), where conv(Ξ extr ) denotes the convex hull of the closure of Ξ extr . To see this, first note that Ξ extr is bounded -for the m + 1 coordinate, Shannon entropy is non-negative with a maximum value set by the cardinality of the value space of C, and for the first m coordinates, these are contained in the image of the set Π extr through the continuous map g -and since Π extr is contained in the compact set P = [0, 1] n (P contains all probability distributions), its image must be contained in the compact (and thus bounded) set g(P). As Ξ extr is bounded, its closure, denoted Ξ extr , must be bounded as well and so is compact. It is a known fact that the convex hull of a compact set in R n is compact, so conv(Ξ extr ) is compact -and so in particular, closed. Finally conv(Ξ extr ) clearly contains Ξ = conv(Ξ extr ), the convex hull of a smaller set; as a closed set containing Ξ, it will contain the Ξ-boundary point (g(ρ), h min (ρ)). Now we show that this implies containment in Ξ proper. Since the map h(ρ) := (g(ρ), H ρ (C|Z)) with image in R m+1 is continuous on the domain of n-dimensional probability distributions and Π extr is bounded, we have h(Π extr ) ⊆ h(Π extr ) † and since by definition Ξ extr = h(Π extr ), we write Ξ extr ⊆ h(Π extr ).
Now using (68), (69), the definition of h(·), and finally (66), we can write Comparing the first expression in the above sequence to the last and applying (67) implies that i p i τ i = ρ. Now since by assumption Π is closed, Π extr ⊆ Π implies Π extr ⊆ Π, so Π = conv(Π extr ) implies that elements of Π extr can be expressed as convex combinations of elements of Π extr . Thus in the expression i p i τ i , if there are any non-extremal τ i elements they can be replaced with convex combinations of elements of Π extr to yield a convex combination j q j σ j equalling ρ where the concavity of conditional Shannon entropy implies that j q j H σ j (C|Z) is not larger than i p i H τ i (C|Z). However, by (70), i p i H τ i (C|Z) = h min (ρ) and since j q j H σ j (C|Z) cannot be smaller than h min (ρ), it must equal h min (ρ). As j q j σ j = ρ and j q j H σ j (C|Z) = h min (ρ), one more application of (66) yields which is in Ξ.
The argument thus far demonstrates the existence of a convex combination of Π extr elements explicitly achieving the infimum in the definition of h min (ρ). We continue with our argument to further demonstrate that the number of required Π extr elements in such an optimal decomposition is not greater than m + 1.
We first note that since (g(ρ), h min (ρ)) is on the boundary of the convex set Ξ, the supporting hyperplane theorem says there is a supporting hyperplane H ρ with (g(ρ), h min (ρ)) ∈ H ρ and Ξ entirely on one side of H ρ . Now, notice that if we decompose (g(ρ), h min (ρ)) as a † For any bounded subset S in R n (like Π extr ) and continuous h, we have h(S) ⊆ h(S). Proof: Any x ∈ h(S) must be the limit of a sequence in h(S); let has a convergent sub-sequence {s j } ∞ j=1 with limit in S; let s ∈ S be this limit. By continuity, h(s j ) → h(s); considered as a sub-sequence of {h(s i )} ∞ i=1 , we also have h(s j ) → x and so uniqueness of limits implies x = h(s) ∈ h(S). convex combination of Ξ extr elements, these elements must all lie in the hyperplane H ρ : this is because any elements strictly on one side of H ρ would have to be counterbalanced by elements strictly on the other side of H ρ -but since one side of H ρ is disjoint from Ξ, this is not possible. Applying the same observation to any other element of H ρ ∩ Ξ, it follows that H ρ ∩ Ξ is contained in the convex hull of H ρ ∩ Ξ extr . As the reverse inclusion follows from the convexity of H ρ and the fact that Ξ = conv(Ξ extr ), we can write conv(H ρ ∩ Ξ extr ) = H ρ ∩ Ξ. Now since H ρ ∩ Ξ is at most m dimensional (H ρ , as a hyperplane, has one fewer dimension than the ambient (m + 1)-dimensional space), we can invoke Carathéodory's theorem to see that at most m + 1 points of H ρ ∩ Ξ extr are required to replicate (g(ρ), h min (ρ)) as a convex combination. Thus we have (g(ρ), h min (ρ)) = i p i ⃗ w i for some { ⃗ w i } i∈I ⊆ Ξ extr , |I| ⩽ m + 1 and so recalling the definition of Ξ extr and invoking (66) one last time, we can write that for some integer m * satisfying 1 ⩽ m * ⩽ m + 1, By (67), m * i=1 p i σ i = ρ and so {σ i } m * i=1 induces the desired distribution ω(CZE) by setting ω(CZ|e i ) = σ i (CZ) and ω(e i ) = p i .
Theorem. Suppose Π satisfies the conditions of Theorem 4 and ρ is in the interior of Π. Then there exists an entropy estimator whose entropy estimate at ρ is equal to h min (ρ).
Proof. We continue from where we left off in the proof of Theorem 4, and show that the supporting hyperplane H ρ discussed in that proof can be used to construct an affine function that is the desired entropy estimator. Recall that the dimension of Π, which is embedded in a higher dimensional vector space R N , is defined as the affine dimension of A, the smallest affine subspace containing Π. Given this context, the assumption that ρ is in the interior of Π means that there exists an open ϵ-ball U in R N such that U ∩ A, which is open in the subspace topology, is contained in Π. † First, we note that g(ρ) is in the interior of g(Π). To see this, consider the restriction g↾ A of g to A, which is a bijection with affine-linear inverse map (g↾ A ) −1 : R m → A given by A(·) + ⃗ b (recalling the construction following (65) in the proof of Theorem 4). This ensures that the set g↾ A (U ∩ A) must be open, as it is equal to the inverse image of U ∩ A under the map (g↾ A ) −1 which is equal to the inverse image of the open set U under the (continuous) map A(·) + ⃗ b : R m → R N . Hence g(ρ) is contained in the open set g(U ∩ A) which is a subset of g(Π) as U ∩ A ⊆ Π. Now we take a closer look at H ρ , the supporting hyperplane touching Ξ at (g(ρ), h min (ρ)). As a hyperplane, H ρ will be equal to the set of ⃗ x satisfying an equation of the form ⃗ a · ⃗ x = b for some fixed ⃗ a ∈ R m+1 and b ∈ R, where · denotes the dot product, and the condition ξ ∈ Ξ ⇒ ⃗ a · ξ ⩾ b (71) expresses algebraically the notion that Ξ is on one side of H ρ . We argue that the fact that g(ρ) is in the interior of g(Π) implies the m + 1 component of ⃗ a, denoted ⃗ a m+1 , must be nonzero. Assume ⃗ a m+1 = 0 for a proof by contradiction: since (g(ρ), h min (ρ)) is the point of contact of the supporting hyperplane H ρ , we have ⃗ a · (g(ρ), h min (ρ)) = b, which implies ⃗ a [m] · g(ρ) = b where ⃗ a [m] ∈ R m denotes the vector consisting of the first m coordinates of ⃗ a. Since the previous paragraph demonstrated there is an open subset of g(Π) containing g(ρ), this means g(ρ) − c⃗ a [m] for a sufficiently small positive c is equal to g(ϕ) for some ϕ in Π. By construction ϕ will satisfy ⃗ a [m] · g(ϕ) < b, but since ⃗ a m+1 = 0 this requires ⃗ a · (g(ϕ), h min (ϕ)) < b as well. This would imply (g(ϕ), h min (ϕ)) / ∈ Ξ; however this is a contradiction as the arguments of Theorem 4 show that for any ϕ ∈ Π, (g(ϕ), h min (ϕ)) belongs to Ξ (the arguments of Theorem 4 demonstrated this for ρ but they apply to any element of Π).
We now use f ρ • g to construct the desired entropy estimator as follows. We have · ⃗ k)/⃗ a m+1 is a constant and ⃗ n = −(1/⃗ a m+1 )M T ⃗ a [m] is an N -dimensional vector; that is, it has one component for each possible distinct outcome pair c, z for the random variable pair C, Z. Now we can define K(c, z) := ⃗ n cz + d to obtain a function of C, Z satisfying and thus K is an entropy estimator satisfying the conditions of the theorem.

C Concavity of Conditional Shannon Entropy
It is known that conditional Shannon entropy is concave. For completeness, we provide a brief proof of how this follows from the concavity of (unconditional) Shannon entropy. Let ν be a convex combination of ν 1 and ν 2 , so that for all (c, z) ∈ C × Z we have ν(c, z) = x µ λ (x) = 1, making µ λ a distribution. It is easy to verify that for λ ′ = ϵ/(p|X | + ϵ − 1) the above function adds up to unity when summed over x ∈ X . We just need to ensure that ϵ/(p|X | + ϵ − 1) ∈ [0, 1] holds. To see this, note that p|X | ⩾ 1 so we have p|X | + ϵ − 1 ⩾ ϵ, and since ϵ > 0 the quotient must indeed lie in [0, 1]. Finally, µ λ ′ (X) satisfies the bounds in the Lemma: since f (x) ⩽ p for all x ∈ X , for any λ ∈ [0, 1] we have , ∀x ∈ X (76) and the middle term above is µ λ ′ (X) for λ = λ ′ . E Inequalities relating smooth average conditional minentropy and smooth worst-case conditional min-entropy Here we state and prove a known inequality that relates two notions of smooth conditional min-entropy. We present this result without structuring random variables as stochastic sequences, i.e., instead of considering distributions of C, Z, E we consider distributions of X, Y . The result and its proof can be adapted to the more general case involving sequence of random variables. A stricter definition of smooth conditional min-entropy than the one stated above is the ϵ-smooth "worst-case" conditional min-entropy, introduced in [RW05]. It reads as follows: H wst,ϵ ∞,µ (X|Y ) = max σ∈B ϵ (µ) − log 2 max x∈X ,y∈Y σ(x|y) .
For purposes of randomness extraction or scenarios involving predictability of an adversary, the smooth average conditional min-entropy suffices. One can show that the notions of average-case and worst-case are equivalent up to an additive factor [DRS04]. This is formalised in Proposition 2.
Proof. For a fixed value of n and ϵ the optimisation problem in (8) is equivalent to the following: Maximise: E ρ [log 2 (F (ABXY ))] Subject to: where the constraints range over the extremal points of Π NS as given in (33) and (34). We show that for β ⩾ log 2 (4/3), the above constraints are equivalent to noticing that β does not appear in (84). It is immediate to see that the constraints of (83) imply (84): since µ(AB|XY ) is always zero or one for local deterministic distributions, in this case we have µ(AB|XY ) β = µ(AB|XY ) and thus for each choice of j we have E µ j LD [F (ABXY )µ j LD (AB|XY ) β ] ⩽ 1 implying the non-β counterpart E µ j LD [F (ABXY )µ j LD (AB|XY )] ⩽ 1 in (84). Now we demonstrate the reverse implication. First, the argument just given also works in the opposite direction to show that the the non-β constraints of (84) imply the corresponding constraints (with β) in (83). We thus need only to show that the E µ i PR [· · · ] ⩽ 1 in (83) are implied as well. We give a specific argument for the PR box given in Table 1; symmetric arguments apply for the other PR boxes. Since any distribution µ(ABXY ) is the behaviour µ(AB|XY ) times a fixed settings distribution σ s (XY ), we can express the product F (abxy)σ s (xy) as F ′ (abxy) for all choices of (a, b, x, y) when the expectation functional E[·] is written out in full. The constraints (84) then imply, by summing over the eight of them corresponding to the eight local deterministic distributions appearing in Table 1 (a set we denote LD 1 ), that a,b,x,y F ′ (abxy) Noticing that the inner sum above is always 3 or 1 (this corresponds to the number of 1s appearing in each column of Table 1, with the result given in Table 8    Since M, N are both non-negative, we can drop N to find that 3M + N ⩽ 8 implies M ⩽ 8/3 = 2 1+log 2 (4/3) which in turn implies M ⩽ 2 1+β whenever β ⩾ log 2 (4/3). Since E µ PR,1 [F (ABXY )µ PR,1 (AB|XY ) β ] is equal to M (1/2) 1+β (see Table 7) the constraint E µ PR,1 [· · · ] ⩽ 1 follows.