Bounding the Plausibility of Physical Theories in a Device-Independent Setting via Hypothesis Testing

The device-independent approach to physics is one where conclusions about physical systems (and hence of Nature) are drawn directly and solely from the observed correlations between measurement outcomes. This operational approach to physics arose as a byproduct of Bell’s seminal work to distinguish, via a Bell test, quantum correlations from the set of correlations allowed by local-hidden-variable theories. In practice, since one can only perform a finite number of experimental trials, deciding whether an empirical observation is compatible with some class of physical theories will have to be carried out via the task of hypothesis testing. In this paper, we show that the prediction-based-ratio method—initially developed for performing a hypothesis test of local-hidden-variable theories—can equally well be applied to test many other classes of physical theories, such as those constrained only by the nonsignaling principle, and those that are constrained to produce any of the outer approximation to the quantum set of correlations due to Navascués-Pironio-Acín. We numerically simulate Bell tests using hypothetical nonlocal sources of correlations to illustrate the applicability of the method in both the independent and identically distributed (i.i.d.) scenario and the non-i.i.d. scenario. As a further application, we demonstrate how this method allows us to unveil an apparent violation of the nonsignaling conditions in certain experimental data collected in a Bell test. This, in turn, highlights the importance of the randomization of measurement settings, as well as a consistency check of the nonsignaling conditions in a Bell test.


Preliminaries
For a complete description of the prediction-based-ratio method and a comparison of its strength against the martingale-based method [44], we refer the reader to Ref. [43]. Here, we merely recall the necessary ingredients of the prediction-based-ratio method and show how it can be used to achieve the purpose of bounding the plausibility of physical theories based on the data collected in a Bell test, with minimal assumptions. Making this possibility evident and demonstrating how well it works in practice are the main contributions of the present work.
For simplicity, the following discussions are based on a Bell test that involves two parties (Alice and Bob) who are each allowed to perform one of two measurements randomly selected at each trial, each produces one of two possible outcomes. Generalization to other Bell scenarios will be evident. To this end, let us denote the measurement choice (input) of Alice (Bob) by x (y) and the corresponding measurement outcome (output) by a (b), where a, b, x, y ∈ {0, 1}. The extent to which the distant measurement outcomes are correlated is then succinctly summarized by the collection of joint conditional probability distributions P = {P(a, b|x, y)} a,b,x,y .
In an LHV theory, the outcome probability distributions can be produced with the help of some LHV λ (distributed according to q λ ) via the local response functions satisfying 0 ≤ P A λ (a|x), P B λ (b|y) ≤ 1 and ∑ a P A λ (a|x) = ∑ b P B λ (b|y) = 1 such that [2]: Hereafter, we refer to any P that can be decomposed in the above manner as a (Bell-) local correlation and denote the set of such correlations as L.
In contrast, if Alice and Bob conduct the experiment by performing local measurements on some shared quantum state ρ, quantum theory predicts setting-dependent outcome distributions for all a, b, x, y of the form: P(a, b|x, y) = tr(ρ M A a|x ⊗ M B b|y ), (2) where M A a|x and M B b|y denote, respectively, the local positive-operator-value-measure element associated with the a-th outcome of Alice's x-th measurement and the b-th outcome of Bob's y-th measurement. Accordingly, we refer to any P that can be written in the form of Equation (2) as a quantum correlation and the set of such correlations as Q.
Importantly, both local and quantum correlations satisfy the nonsignaling conditions [29]: P A (a|x, y) = P A (a|x, y ) := P A (a|x) ∀ a, x, y, y , where P A (a|x, y) := ∑ b P(a, b|x, y) and P B (b|x, y) := ∑ a P(a, b|x, y) are marginal probability distributions of P(a, b|x, y). Should (any of) these conditions be violated in a way that is independent of spatial separation, Alice and Bob would be able to communicate faster-than-light [28] via the choice of measurement x, y. We shall denote the set of P satisfying Equation (3) as N S. It is known that L, Q, and N S are convex sets and that they satisfy the strict inclusion relations L ⊂ Q ⊂ N S (see, e.g., Ref. [19] and references therein).
A few other convex sets of correlations are worth mentioning for the purpose of subsequent discussions. To this end, note that the problem of deciding if a given P is in Q is generally a difficult problem. However, the characterization of Q can, in principle, be achieved by solving a converging hierarchy of semidefinite programs [58] due to Nacascués, Pironio, and Acín (NPA) [59,60] (see also Ref. [61,62]). The lowest level outer approximation of Q in this hierarchy, often denoted by Q 1 ⊃ Q, happens to be exactly the set of correlations that is characterized by the physical principle of macroscopic locality [51]. A finer outer approximation of Q corresponding to the lowest-level hierarchy of Ref. [62], which we denote byQ, is known in the literature as the almost-quantum set [50], as it appears to satisfy all the physical principles that have been proposed to characterize Q. In Section 3, we useQ and N S as examples to illustrate how the prediction-based-ratio method can be adapted to test physical theories that are constrained to produce correlations from these sets.

Finite Statistics and the Prediction-Based-Ratio Method
Coming back to an actual Bell test, let N total be the total number of experimental trials carried out during the course of the experiment. During each experimental trial, x and y are to be chosen randomly according to some fixed probability distribution P xy (This distribution may be varied from one trial to another but for simplicity of discussion, we consider in this work only the case where this is fixed once and for all before the experiment begins). From the data collected in a Bell test, a naïve (but very commonly-adopted) way to estimate the correlation P between measurement outcomes is to compute the relative frequencies f that each combination of outcomes (a, b) occurs given the choice of measurement (x, y), i.e., where N a,b,x,y is the total number of trials the events corresponding to (a, b, x, y) are registered and N x,y = ∑ a,b N a,b,x,y is the number of times the particular combination of measurement settings (x, y) is chosen. By definition, N total = ∑ x,y N x,y . If the experimental trials are independent and identically distributed (i.i.d.) corresponding to a fixed state ρ with fixed measurement strategies {M A a|x } a,x , {M B b|y } b,y , then in the asymptotic limit, lim N total →∞ f (a, b|x, y) = P(a, b|x, y) where P here would satisfy Equation (2). In this limit, the amount of statistical evidence in the data against a particular hypothesis H can be quantified by the Kullback-Leibler (KL) divergence [63] (also known as the relative entropy) from P to L, see Refs. [64,65] for a detailed explanation with quantum experiments. We remark that the KL divergence is directly related with the Fisher information metric and so it measures the distinguishability of a distribution from its neighborhood. This provides a motivation for using the KL divergence as a measure of statistical evidence.
In the (original) prediction-based-ratio method of Ref. [43] (see also Ref. [66]), the hypothesis of interest is that the experimental data can be produced using an LHV theory, in other words, that the underlying correlation P ∈ L. For convenience, we shall refer to this hypothesis as L. In this case, given f and P xy , the relevant KL divergence from f to L reads as D KL f ||L := min P∈L ∑ a,b,x,y P xy f (a, b|x, y) log f (a, b|x, y) P(a, b|x, y) As the objective function in Equation (5) is strictly convex in P and the feasible set L is convex, the minimizer of the above optimization problem-which we shall denote by P L, * KL -is unique (see, e.g., Ref. [27]). It follows from the results presented in Ref. [43] that this unique minimizer P L, * KL can be used to construct a Bell inequality: ∑ a,b,x,y R(a, b, x, y)P xy P(a, b|x, y) where the non-negative coefficients of the Bell inequality are defined via the ratios R(a, b, x, y) := f (a, b|x, y) This Bell inequality is the key ingredient of the prediction-based-ratio method and is ideally suited for performing a hypothesis test of L.
To understand the method, we introduce the random variables X and Y to denote the random inputs and the variables A and B to denote the random outputs of Alice and Bob at a trial. The ability to select measurement settings randomly, in particular, is an indispensable prerequisite of the prediction-based-ratio method, or more generally, a proper Bell test (see, e.g., Ref. [20]). We further denote the possible values of inputs and outputs by the respective lower-case letters. Then we can think of the ratio R in Equation (6) as a non-negative function of the inputs X, Y and outputs A, B at each experimental trial such that its expectation according to an arbitrary P ∈ L with the fixed input distribution P xy satisfies Equation (7) is an alternative way of expressing the Bell inequality of Equation (6). A real experiment necessarily involves only a finite number N total = (N est + N test ) of experimental trials in time order. Here, we have split the experimental data into two sets: the data from the first N est trials as the training data and the data from the remaining N test trials as the hypothesis-testing data. In practice, we first construct the function R using the training data and then perform a hypothesis test with the test data. Since the ratio R is determined before the hypothesis test based on the prediction according to the training data, R is called a prediction-based ratio.
Given a prediction-based ratio and a finite number N test of test data, we can quantify the evidence against the hypothesis L by a p-value. For concreteness, suppose that the actual measurements chosen at the i-th test trial are x i , y i and the corresponding measurement outcomes observed are a i , b i . Then the value of the prediction-based ratio at the i-th test trial is R(a i , b i , x i , y i ), abbreviated as r i . We introduce a test static T as the product of the possible values of the prediction-based ratio at all test trials, so the observed value of the test statistic is t = ∏ N test i=1 r i . If we denote by N a,b,x,y the total number of counts registered for the input-output combination (a, b, x, y) in the test data, then t can be expressed also as According to Ref. [43], the p-value, which is defined as the maximum probability according to the hypothesis L of obtaining a value of T at least as high as t actually observed in the experiment, is bounded by The smaller the p-value, the stronger the evidence against the hypothesis L is, in other words, the less plausible LHV theories are. It is worth noting that the p-value bound computed in this manner remains valid even if the experimental trials are not i.i.d., while when the experimental trials are i.i.d., the p-value bound is asymptotically optimal (or tight) [43].

Generalization for Hypothesis Testing Beyond LHV Theories
The following two simple observations, which allow one to apply the prediction-based-ratio method to test physical theories beyond those described by LHV, are where our novel contribution enters. Firstly, we make the observation that in the above arguments leading to the p-value bound of Equation (9), the actual hypothesis L only enters at Equation (6) via the set of correlations L compatible with the hypothesis L. In particular, if we are to consider the hypothesis H that the data observed is produced by a physical theory H (e.g., a nonsignaling theory), then we merely have to replace L by the (convex) set of correlations H (e.g., N S) associated with H in the optimization problem of Equation (5). The method then allows us to bound the plausibility of the hypothesis H via the p-value bound in Equation (9) with the possible values of the prediction-based ratio given by where P H, * KL is the unique minimizer of the optimization problem: Although Equation (8), Equation (9) and Equation (10) together provide us, in principle, a recipe to test the plausibility of a general physical theory H, its implementation depends on the nature of the set of correlations associated with the hypothesis. Indeed, a crucial part of the procedure is to solve the optimization problem of Equation (11) for the convex set of correlations H compatible with H, which is generally far from trivial. If H is a convex polytope, such as L and N S, or the set of correlations associated with the models considered in Refs. [67,68]), it is known [43] that Equation (11) can indeed be solved numerically.
Our second observation is that for the convex sets of correlations that are amenable to a semidefinite programming characterization, such as those considered in Refs. [59,62,69,70], Equation (11) is an instance of a conic program [58] that can be efficiently solved using a freely available solver, such as PENLAB [71]. To see this, one first notes that, apart from the constant factor P xy , the optimization of Equation (11) is essentially the same as that considered in Ref. [27]. A straightforward adaptation of the argument presented in Appendix D 2 of Ref. [27] would then allow us to complete the aforementioned observation. The data observed in a Bell test can thus be used to test not only L, but also N and even the hypothesis Q that the observation is compatible with Born's rule, cf. Equation (2), via outer approximations of Q (such as Q 1 andQ).
A remark is now in order. In order to avoid so-called p-value hacking, it is essential that the test data used in the computation of the test statistic T is not used to determine f , and hence the values of the prediction-based ratio R in Equation (10). In this work, for simplicity we use the first N est trials of an experiment as the training data for estimating f and further constructing a prediction-based ratio R that is applied for all test trials. In principle, we can use different training data for different test trials. For example, we can define the training data for a test trial as the data from all trials performed before this test trial, and then we can adapt the construction of the prediction-based ratio for each individual test trial. We refer to Ref. [43] for more details on the adaptability of the prediction-based ratio.

Results
To illustrate how well the prediction-based-ratio method works in identifying data that are not even explicable by some nonlocal physical theories, such as quantum theory, we now consider a few examples of applications of the method. As above, we restrict our attention to a bipartite Bell test, where each party performs two binary-outcome measurements randomly selected at each trial. Throughout this section, we assume that the input distribution is uniform, specifically P xy = 1 4 for all combinations of x, y ∈ {0, 1}. In Sections 3.2 and 3.3 we study the behaviour of numerically simulated Bell tests based on hypothetical sources of correlations described in Section 3.1, while in Section 3.4, we analyze the real experimental data reported in Ref. [72].

Modeling a Bell Test
For our numerical simulations, we consider a P that resembles a nonlocal source targeted at in various actual Bell experiments [35][36][37]72,73]: where v ∈ [0, 1], P PR is the Popescu-Rohrlich (PR) correlation [28] P PR (a, b|x, y) = 1 2 δ a⊕b,xy with a, b, x, y ∈ {0, 1}, and P I (a, b|x, y) = 1 4 for all a, b, x, y is the white-noise distribution. In Equation (12), the real parameter v can be seen as the weight associated with P PR in the convex mixture. Importantly, the nonlocal source represented by such a mixture can (in principle) be produced by performing appropriate local measurements on a maximally entangled two-qubit state if and only if v ≤ v c := 1 √ 2 ≈ 0.71 (see, e.g., Refs. [27,57]). In particular, when v = v c -corresponding to an ideal nonlocal source-the mixture gives rise to the maximal quantum violation of the CHSH [45] Bell inequality.
To mimic an experimental scenario with noise (something unavoidable in practice), we shall introduce a slight perturbation to the ideal source P(v) of Equation (12). Specifically, we require the measurement outcomes observed at each trial in the simulated Bell test to be governed by the nonlocal source (1 − ) P(v) + P noise , where 1 is the weight associated with the noise term P noise . Moreover, for the purpose of illustrating the effectiveness of the method in identifying non-quantum-compatible data, we set v > v c . In our simulations, we set = 0.01 and v = 0.72 > v c . However, as long as the given mixture lies outsideQ (and hence also outside Q), the actual choices of 1 and v ∈ (v c , 1] are irrelevant. The only impact that these choices may have is the number of trials N total needed to falsify the hypothesis "The observed data is compatible with a physical theory that is constrained to produce only the almost-quantum set of correlations." with the same level of confidence. Inspired by the experiments of Ref. [72] where N total = 10 5 ∼10 6 , we set in our simulations N total = 10 6 . Note also that instead ofQ, we can equally well choose another set of correlations that admits a semidefinite programming characterization, such as those described in Refs. [59,62]. Since we are interested to model a nonlocal source that obeys the nonsignaling conditions of Equation (3), there is no loss in generality by considering P noise ∈ N S. To this end, let P Ext j be the j-th extreme point of the nonsignaling polytope [29], then we may write P noise = ∑ j p j P Ext j where p j is the weight associated with P Ext j in the convex decomposition of P noise . We may thus write the nonlocal source of interest as: Finally, to simulate the raw data {(a i , b i , x i , y i )} N i=1 obtained in an N-trial Bell test for any given input distribution P xy and correlation P, we make use of the MATLAB toolbox Lightspeed developed by Minka [74].

Simulations of Bell Tests with an i.i.d. Nonlocal Source
Let us begin with the case of i.i.d. trials, corresponding to a source of correlation that remains unchanged throughout the experiment, and where the inputs at each trial are independent of the inputs of the previous trials. To this end, we first sample the weights {p j } j uniformly from the interval [0, 1] and renormalize them such that ∑ j p j = 1. With our choice of v = 0.72 and = 0.01, it is easy to find such a randomly generated correlation P(v, , {p j }) that lies outsideQ. (Verifying that any given P is (not) iñ Q can be carried out by solving a semidefinite program. Specifically, for any given correlation P, if the maximal white-noise visibility ν such that ν P + (1 − ν) P I ∈Q is smaller than 1, then P ∈Q ⊃ Q, and hence outside Q, otherwise P ∈Q.) For convenience, we denote by P the specific set of {p j } j employed in our simulation of 500 Bell tests, each with N total = 10 6 trials. In Figure 1, we summarize the steps involved in our analysis of the numerically simulated data using the prediction-based-ratio method. The resulting p-value upper bounds are summarized in Table 1.
of a single Bell test. In the first step, we separate the data into two sets, with the data collected from the first N est trials serving as the training data while the rest is used for the actual hypothesis testing. Specifically, the training data is used to compute the relative frequencies f and to minimize the KL divergence D KL ( f ||H) with respect to the set of correlations H ∈ {N S,Q} associated, respectively, with the hypothesis of N andQ. The correlation P H, * KL ∈ H that minimizes D KL ( f ||H) gives rise to a Bell-like inequality with coefficients {R(A = a, B = b, X = x, Y = y)} x,y,a,b . The remaining data is then used to compute t = ∏ i>N est r i where r i := R(a i , b i , x i , y i ). Finally, a p-value bound according to the hypothesis is obtained by computing min{ 1 t , 1}.
As expected, despite statistical fluctuations, the data does not suggest any obvious evidence against the nonsignaling hypothesis. In fact, among the 500 p-value bounds obtained, 97% of them are trivial (i.e., equal to unity), while the smallest non-trivial p-value bound obtained is approximately 0.14. On the contrary, for the hypothesis test of the almost-quantum set of correlations, more than half of the simulated Bell tests give a p-value upper bound that is less than 10 −10 . Although there are also 5.8% of these simulated Bell tests that give a trivial p-value bound according to the almost-quantum hypothesis, we see that the method generally works very well in falsifying this hypothesis. In fact, a separate calculation (not shown in the table) shows that when we increase N total to 10 7 , all the 500 p-value upper bounds obtained according to the almost-quantum hypothesis are less than or equal to 10 −10 . Table 1. Summary of frequency distributions of the p-value upper bounds obtained from 500 numerically simulated Bell tests, each consists of N est = 10 6 trials and assumes the same i.i.d. nonlocal source P(v, , {p j }) of Equation (13) that lies outsideQ. The second and third row give, respectively, the frequency distributions according to the hypothesis associated with N S (nonsignaling) andQ (almost-quantum). For these hypotheses, the smallest p-value upper bound found among these 500 Bell tests are, respectively, 0.14 and 5.7 × 10 −20 . The second to the fifth column give, respectively, the fraction of simulated Bell tests having a p-value upper bound (for each hypothesis) that satisfies the given (increasing) threshold (e.g., 10 −10 for the second column). Similarly, in the last column, we give the fraction of instances where the p-value upper bound obtained is trivial, i.e., exactly equals to 1. The smaller the p-value upper bound, the less likely it is that a physical theory associated with the hypothesis produces the observed data. Thus, the larger the value in the second (to the fourth) column, the less likely it is that the assumed physical theory holds true. In contrast, the larger the value in the rightmost column, the weaker the empirical evidence against the assumed theory is.

Simulations of Bell tests with a non-i.i.d. Nonlocal Source
In a real experiment, the assumption that the experimental trials are i.i.d is often far from justifiable, as that would require, for example, that the experimental setup remain as it is over the entire course of the experiment. As a result, we also consider here the case where the source that generates the data actually varies from one trial to another. To this end, for the i-th trial of the Bell test, we simulate according to the conditional outcome distributions: where n i = 1, 2, . . . , 24 labels the single nonsignaling extreme point used to mix with P(v) at this trial, cf. Equation (13) with p j = 1 if j = n i but vanishes otherwise. Moreover, to facilitate a comparison with the i.i.d. case, before the i-th trial, we randomly pick n i according to the probability P(n i = j) = p j where p j ∈ P is exactly the probability employed in the simulation of Section 3.2. With this choice, the outcome distributions governed by the nonlocal source of Equation (14) (for the i-th trial) averages to that of Equation (13) when the number of trials N total → ∞. Again, we follow the steps summarized in Figure 1 to compute the relevant p-value upper bounds using the prediction-based-ratio method. The resulting p-value upper bounds are summarized in Table 2. Table 2. Summary of frequency distributions of the p-value upper bounds obtained from 500 numerically simulated Bell tests. Each of these Bell tests involves N est = 10 6 trials and each trial assumes a varying source P i (v, , n i ) of Equation (14). For the hypothesis of N andQ, associated with N S (second row) andQ (third row), respectively, the smallest p-value upper bound found among these 500 instances are 0.21 and 1.3 × 10 −15 . The significance of each column follows that described in the caption of Table 1.
As with the i.i.d. case, for these 500 simulated Bell tests, our application of the prediction-based-ratio method does not lead to any obvious evidence against the nonsignaling hypothesis N. However, for the hypothesis associated with the almost-quantum setQ, our results (last row of Table 2) give more than half of the p-value upper bounds that are less than 10 −4 (accordingly, 17% if we set the cutoff at 10 −10 ). Although there are 24% of these instances where the returned p-value upper bound for the same hypothesis is trivial, we see that, as with the i.i.d. case, the method remains very effective in showing that the observed data cannot be entirely accounted for using a theory that is constrained to produce only almost-quantum correlations. In addition, as with the i.i.d. case, our separate calculation shows that the effectiveness of this method can be substantially improved when we increase N total to 10 7 : all the 500 p-value upper bounds obtained according to the almost-quantum hypothesis become less than or equal to 10 −10 .

Application to Some Real Experimental Data
Armed with the experience gained in the above analyses, let us now analyze the experimental results presented in Figure 3 of Ref. [72] using the prediction-based-ratio method. One of the goals of Ref. [72] was to experimentally approach the boundary of the quantum set of correlations in the two-dimensional subspace spanned by the two Bell parameters: where E xy := ∑ 1 a,b=0 (−1) a+b P(a, b|x, y) is the correlator. To this end, the Bell parameter S CHSH cos θ + S CHSH sin θ for 180 uniformly-spaced values of θ ∈ {θ 1 , θ 2 , . . . , θ 180 } ⊂ [0, 2π) were estimated by performing the measurements presented in Appendix A of Ref. [72] on a two-qubit maximally entangled state.
Unfortunately, only the total counts for each combination of input-output N a,b,x,y (rather than the time sequences of raw data) given the value of θ are available [75]. Therefore, in analogy with the analyses presented above, we use the relative frequencies obtained for θ k as the training data to derive a prediction-based ratio (which corresponds to a Bell-like inequality) for the hypothesis test using the data associated with θ k+1 (for the case of k = 180, the hypothesis test uses the data associated with θ 1 ). The analysis therefore essentially follows the steps outlined in Figure 1, but with the computation of t carried out using Equation (8) instead, since we do not have the time sequences of raw data. Moreover, to apply the prediction-based-ratio method, we assume, as with the numerical experiments reported earlier that the input distributions are uniform, i.e., P xy = 1 4 for all combinations of x, y ∈ {0, 1}. A summary of the p-value upper bounds obtained from these 180 Bell tests is given in Table 3.
For both hypotheses, approximately half of the p-value upper bounds obtained are trivial. At the same time, about the same fraction of the p-value bounds obtained are less than 10 −2 (with the majority of them being less than 10 −4 ). In fact, the smallest of the p-value upper bounds are remarkably small: 3.2 × 10 −55 for the hypothesis of nonsignaling N and 2.7 × 10 −55 for the hypothesis of almost-quantumQ. These results strongly suggest that under the assumption that the measurement settings were randomly chosen according to a uniform input distribution, it is extremely unlikely that a physical theory associated with each of these hypotheses can produce the observed relative frequencies.
These conclusions that the observed data are incompatible with the fundamental principle of nonsignaling or with quantum theory (via the almost-quantum hypothesis), however, turn out to be flawed, as it was brought to our attention [75] that during the course of the experiment, the measurement bases were not at all randomized-the measurements were carried out in blocks using the same combination of (x, y) before moving to another. Why should this pose a problem? In the extreme scenario, if the measurement settings were fully correlated to some local hidden variable, it is known that the the resulting correlation between measurement outcomes can violate the nonsignaling conditions of Equation (3), see, e.g., Ref. [76]. Consequently, it is not surprising that in the prediction-based-ratio method (as well as any other methods employed for the statistical analysis of a Bell test), the measurement inputs (x i , y i ) during the i-th trial, as discussed in Section 2, ought to be randomly chosen. Table 3. Summary of frequency distributions of the p-value upper bounds obtained from the 180 Bell tests of Ref. [72] according to the hypothesis of N andQ (associated, respectively, with N S, the second row, andQ, the third row) under the assumption that the measurement settings were randomly chosen according to a uniform distribution. The significance of each column follows that described in the caption of Table 1.

Discussion
As discussed in the last section, the conclusion that "the experimental data of Ref. [72] show a violation of the nonsignaling principle" based on an erroneous application of the prediction-based-ratio method is unfounded. The results are nonetheless thought-provoking. For example, suppose for now that we had access to the raw data for all trials. Since the analysis was flawed because of the nonrandomnization of measurement settings, one can imagine that-under the assumption that the trials are exchangeable-we first artificially randomize the hypothesis-testing trials to simulate the randomization of measurement settings in the experiment. Should we then expect to obtain p-value bounds with fundamentally different features? The answer is negative. The reason is that in our crude application of the method, only the number of counts N a,b,x,y for each input-output combination matters, see Equation (8). In particular, the actual trials in which a particular combination of (a, b, x, y) appears are irrelevant in such an analysis.
So, if one holds the view that the nonsignaling principle cannot be flawed, then one must come to the conclusion that "should the measurement choices be randomized, it would be impossible to register the same number of counts N a,b,x,y for each input-output combination". A plausible cause for this is that the experimental setup suffered from some systematic drift during the course of the experiment, which is exactly a manifestation that the experimental trials are not i.i.d. It might then appear that a hypothesis test of the nonsignaling principle is hopeless in such a scenario. However, as mentioned above, the prediction-based-ratio method is applicable even for non-i.i.d. experimental trials. Indeed, as we illustrate in Section 3.3 (see, specifically Table 2), such fluctuations have not lead to any false positive in the sense of giving very small p-value upper bound according to the nonsignaling hypothesis.
More generally, as the above example of Section 3.4 illustrates, an unexpectedly small p-value upper bound according to the nonsignaling hypothesis may be a consequence that certain premises needed to perform a sensible Bell test are violated. In other words, an apparent violation as such does not necessarily pose a problem to any physical principle, such as the nonsignaling principle that is rooted in the theory of relativity. However, as nonlocal correlations also find applications in device-independent quantum information processing [18,19], it is important to carry out such consistency checks alongside the violation of a Bell inequality before one applies the estimated nonlocal correlation in any such protocols.
Of course, an unexpectedly small p-value upper bound according to the nonsignaling hypothesis could also be a consequence of mere statistical fluctuation. Indeed, our results in Sections 3.2 and 3.3 show that when a null hypothesis indeed holds true, it can still happen that one obtains a relatively small p-value upper bound (of the order of 10 −1 ) even after a large number of trials (N total = 10 6 ). However, as explained in Appendix 1 of Ref. [43], if a null hypothesis is correct, the probability of obtaining a p-value upper bound smaller than q with the prediction-based-ratio method is no larger than q. Indeed, in each of these instances, p-value upper bounds that are less than 10 −1 occur way less than 50 times among the 500 simulated experiments. In any case, this means that even though the prediction-based-ratio method already gets rids of the often unjustifiable i.i.d. assumption involved in such an analysis, the interpretation of the significance of a small p-value upper bound must still be carried out with care, as advised, for example, in Refs. [77][78][79].

Conclusion
In this work, we revisited the prediction-based-ratio method developed [43]-in the context of a Bell test-for performing hypothesis tests of LHV theories. We showed that with the two observations presented in Section 2.3, the method can equally well be applied to perform hypothesis tests of other physical theories, specifically those that are constrained to produce correlations amenable to a semidefinite programming characterization. Prime examples of such theories include those that obey the principle of nonsignaling [28], those that satisfy the principle of macroscopic locality [51], the so-called v-causal models [67], as well as physical theories that are constrained to produce the almost-quantum set [50] or any other outer approximations [59,62,69] of the quantum set of correlations.
To illustrate the effectiveness of the method, we first numerically simulated 500 Bell tests using a hypothetical source of correlations that lies somewhat outside the almost-quantum set of correlations. We then applied the method to obtain a p-value upper bound according to both the almost-quantum hypothesis and the nonsignaling hypothesis for the simulated data obtained in each of these Bell tests. In the majority (> 90%) of these 500 instances, the p-value upper bound according to the almost-quantum hypothesis is less than 10 −2 . Since a p-value upper bound quantifies the evidence against the assumed (almost-quantum) theory given the observed data, these results show that in most of these simulated Bell tests, the data is unlikely to be explicable by the assumed theory. In a similar manner, we numerically simulated another 500 Bell tests using a hypothetical source that varies from one trial to another. Again, the method remained very effective (giving a p-value upper bound that is less than 10 −2 for 69% of the instances) in identifying the incompatibility between the observed data and the assumed (almost-quantum) theory in such a non-i.i.d. scenario.
Finally, we applied the prediction-based-ratio method to the experimental data of Ref. [72]. To this end, we assumed that the measurement settings were randomly chosen with uniform distributions. An application of the method under this assumption again led to very small p-value upper bounds (10 −4 ) for more than 40% of the 180 Bell tests analyzed-not only for the almost-quantum hypothesis, but also for the nonsignaling hypothesis. Such a violation of the nonsignaling conditions, however, is apparent, as we learned after the analysis that the measurement settings were not randomized during the course of the experiments, thereby invalidating one of the basic assumptions needed in the application of the prediction-based-ratio method. Nonetheless, as we remarked in the Discussion section, the analysis nevertheless unveils that the possibility of using the prediction-based-ratio method to identify a situation where a certain premise is needed to perform a proper Bell test, such as the randomization of settings, is invalidated.
Note added: While preparing this manuscript, we became aware of the work of Smania et al. [80], which also discussed, among others, the implication of not randomizing the settings in a Bell test, and its relevance in quantitative applications.