Assessing, Testing and Estimating the Amount of Fine-Tuning by Means of Active Information

A general framework is introduced to estimate how much external information has been infused into a search algorithm, the so-called active information. This is rephrased as a test of fine-tuning, where tuning corresponds to the amount of pre-specified knowledge that the algorithm makes use of in order to reach a certain target. A function f quantifies specificity for each possible outcome x of a search, so that the target of the algorithm is a set of highly specified states, whereas fine-tuning occurs if it is much more likely for the algorithm to reach the target as intended than by chance. The distribution of a random outcome X of the algorithm involves a parameter θ that quantifies how much background information has been infused. A simple choice of this parameter is to use θf in order to exponentially tilt the distribution of the outcome of the search algorithm under the null distribution of no tuning, so that an exponential family of distributions is obtained. Such algorithms are obtained by iterating a Metropolis–Hastings type of Markov chain, which makes it possible to compute their active information under the equilibrium and non-equilibrium of the Markov chain, with or without stopping when the targeted set of fine-tuned states has been reached. Other choices of tuning parameters θ are discussed as well. Nonparametric and parametric estimators of active information and tests of fine-tuning are developed when repeated and independent outcomes of the algorithm are available. The theory is illustrated with examples from cosmology, student learning, reinforcement learning, a Moran type model of population genetics, and evolutionary programming.


Introduction
When Gödel published his incompleteness theorems [1], there was a commotion in the mathematical world from which it has neither yet recovered nor fully assimilated the consequences [2].Hilbert's program to base mathematics on a finite set of axioms had earlier been pursued by Alfred North Whitehead and Bertrand Russell [3].But this approach turned out to be wrong when Gödel proved that no finite set of axioms in a formal system can prove all its true statements, including its own consistency.In similar but lesser scale, when David Wolpert and William MacReady published their No Free Lunch Theorems (NFLTs, [4,5]), there was disquiet in the community because these results imply that there is no one-size-fit-all algorithm that can do well in all searches [6], so that a "theory of everything" is not possible in machine learning.Wolpert and MacReady concluded that it was necessary to incorporate "problemspecific knowledge into the behavior of the algorithm" [5].Thus active information (actinfo) was introduced in order to measure the amount of information carried by such problem-specific knowledge [7,8].More specifically, the NFLTs say that no search works better on average than arXiv:2208.13828v1[math.ST] 29 Aug 2022 a blind search, i.e., a search according to a uniform distribution.Accordingly, actinfo is defined as where A ⊂ Ω is the non-empty target of the search algorithm, a subset of the finite sample space Ω, and P 0 is a uniform probability measure (P 0 (A) = |A|/|Ω|).P must be seen here as the probability measure induced by the problem-specific knowledge of the researcher, whereas P 0 is the underlying distribution assumed in the NFLTs.It corresponds to absence of problem specific knowledge, in accordance with Bernoulli's Principle of Insufficient Reason (PoIR).An equivalent characterization of actinfo is the reduction of functional information between algorithms that do not and do make use background knowledge.The name functional information was introduced by Szostak and collaborators [9,10].It refers to applications where A corresponds to all outcomes of an algorithm that are functional according to some criterion.Then I f 0 and I f are the self-information (measured in nats) of the event that an algorithm X produces a functional outcome, given that it was generated under P 0 and P, respectively.Suppose we do not know whether problem specific knowledge has been used or not when the random search X ∈ Ω was generated.This corresponds to a hypothesis testing problem where data is generated from distributions P 0 and P under the null and alternative hypotheses H 0 and H 1 , respectively.It follows from (1) that I + is the log likelihood ratio when testing H 0 against H 1 , if data is censored so that only X ∈ A is known.When the sample space Ω is finite or a bounded subset of a Euclidean space, the PoIR can be motivated by the fact that the uniform distribution maximizes Shannon entropy, since thereby it maximizes ignorance about the outcome of X.However, the uniform distribution is not a feasible choice of P 0 for unbounded samples spaces.For this reason actinfo has been generalized to deal with unbounded spaces [11], by choosing P 0 to maximize Shannon entropy under side constraints ξ, such as existence of various moments.This gives rise to a family of null distributions P 0 = P 0ξ , with a ξ a nuisance parameter that has to be estimated or controlled for in order to estimate or give bounds for the active information.
Actinfo has also been used for mode detection in unsupervised learning, among other applications [12,13].Based on previous work by Montañez [14,15], actinfo has been used in the past for hypothesis testing [16].More specifically, [16] regards P as a random measure so that I + is random as well, and finds expressions for the tail probability of I + .

Fine-tuning
Fine-tuning (FT) was introduced by Carter in physics and cosmology [17].According to FT, the constants in the laws of nature and/or the boundary conditions in the standard models of physics must belong to intervals of low probability in order for life to exist.Since its inception, FT has generated a great deal of fascination, seen in multiple divulgation books (e.g., [18][19][20][21]) and scientific articles (e.g., [22][23][24][25]).For a given constant of nature X, the connection between FT and active information can be described in three steps: (i) Establishing the life-permitting interval (LPI) A that allows the existence of life for the constant, with Ω = (0, ∞) = R + or Ω = R the range of values this constant could possibly take, including those that do not permit life.(ii) Determining the probability P 0 (A) of such a LPI.If P 0 = P 0ξ contains unknown parameters ξ, find an upper bound of P 0 (A).(iii) Suppose H 1 corresponds to an agent who uses background knowledge of what is required for life to exist in order to bring about a constant of nature X that with certainty permits life (P(A) = 1).The active information I + = I f 0 = − log P 0 (A) is then a measure of how much background knowledge this agent infused.Following [26,27], we conclude that X is finely tuned when the lower bound − log P 0max (A) of I + = I f 0 is large enough.That is, FT corresponds to infusing a high degree of background knowledge into a problem.
Fine-tuning has also been used in biology.Dingjan and Futerman explored it for cell membranes [28,29], whereas Thorvaldsen and Hössjer [30] formalized it for a large class of biological models.According the [30], a system is fine-tuned if it (a) has an independent specification, and (b) is very unlikely to occur by chance.

The present article
In this article actinfo will not only be used in the algorithmic sense.It will also be employed for testing the presence of and estimating the degree of fine-tuning (FT) of a search algorithm or agent who brings about X.To this end, we will introduce a specificity function f , which quantifies, in terms of f (x), how specified an outcome x ∈ Ω is.The target A, on the other hand, is the set of highly specified states, that is, all states with a degree of specificity that exceeds a given threshold.Then I + in (1) is a test statistic for testing whether an algorithm has a much larger probability of reaching the set of highly specified states or not, compared to a random search.This is a test of FT, since reaching the target corresponds to specificity (a), whereas reaching it with much higher probability than expected by chance corresponds to (b).
To calculate I + , the distributions P 0 and P of the random search algorithm under H 0 and H 1 , respectively, need to be defined.As mentioned above, the null distribution P 0 is typically chosen according to some criterion, such as a maximizer of entropy, possibly with some extra constraints on moments for unbounded Ω, which was the strategy implemented in [26,27].Another possibility is to choose P 0 as the equilibrium distribution of a Markov chain that models the dynamics of the system under the null hypothesis, for instance an evolutionary process with no external input.In general P 0 = P 0ξt involves a number of nuisance parameters ξ, and sometimes also the time point t when an algorithm, that does not make use of external information, stops.The choice of P = P θξt is problem specific, and it possibly involves the nuisance parameters ξ of the null distribution, the time point t when the algorithm stops, as well as tuning parameters θ that correspond to infusing background knowledge into the search problem.Therefore, in its most general form, the actinfo (1) is a function I + = I + (θ, ξ, t) of the tuning parameters θ, the nuisance parameters ξ, and the time point t.
This general framework has many applications, based on different choices of f , A, P 0 , and P. For some models, f is a binary function that quantifies functionality, so that A is the set of objects of a certain type (e.g., proteins, protein complexes or cellular networks) that are functional among the set Ω of all such objects.Another possibility is to choose A as the set of populations whose (expected) fitness exceeds a given threshold.In this setting P 0ξt (A) corresponds to the probability that a randomly chosen object or population would reach target A of high fitness at time t, given that no background knowledge of the specificity function f is used to generate X.The functional information I f 0 = − log P 0ξt (A) corresponds to the amount of external information that an algorithm infuses, given that it brings about X so that A happens with certainty (P(A) = 1) within time t.In this case the object or population is finely tuned when I f 0 is large enough.More generally, we say that an evolutionary algorithm that generates X ∼ P = P θξt after t time steps is finely tuned when I + (θ, ξ, t) is large enough.Typically, θ involves selection parameters that determine to which extent a population evolves towards higher fitness.
The unified treatment of search problems and FT of this paper, is organized as follows: Section 2 introduces the specification function f and the set A of highly specified states.Section 3 introduces a class of probability distributions P = P θ for which the specificity function f is used to exponentially tilt the null distribution P 0 , so that outcomes with high specificity are more likely to occur, and with a scalar tuning parameter θ of P θ that corresponds to the amount of exponential tilting.A proof is presented that it is possible to obtain a Metropolis-Hastings type Markov chain in discrete time t = 0, 1, 2, . .., whose outcome X = X t at time t has the aforementioned exponentially tilted distribution under equilibrium, that is, when t is large.The corresponding actinfo I + (θ, t) is shown to increase monotonically with t towards an equilibrium limit.The actinfo of a search algorithm X = X t∧T that stops at time T, when the targeted set A of highly specified states has been reached, is also shown to increase more rapidly.Section 4 introduces various nonparametric and parametric estimators of actinfo, and corresponding tests of FT, when n repeated and independent outputs of the search algorithm are available.In particular, large deviations theory is used to prove that the significance levels of these tests, i.e. the probability to detect FT under H 0 , goes to zero at an exponential rate when the sample size n increases.Section 5 presents a number of examples from cosmology, student learning, reinforcement learning, and population genetics, that illustrate our approach.A discussion in Section 6 follows, whereas proofs and further details about the models, are presented in Section 7.

Specificity and target
Consider a function f : Ω → R, and assume that the objective of the search algorithm, or the agent that brings about X, is to find regions in Ω where f is large.The rationale for this is an independent specification, where a more specified state x ∈ Ω corresponds to a larger f (x).
It is further assumed that the target set in (1) has the form This implies that the purpose of the search algorithm or the agent is to bring about an X that is highly specified.We will refer to f as a specificity function of the agent or an objective function of the search algorithm.For instance, in cosmological FT, x is the value of a particular constant of nature and the specificity function equals where 1 {•} is the indicator function.That is, f has a binary range, with f (x) = 1 and 0 corresponding to whether x permits a universe with life or not.From this, A is the LPI of this constant if f (x 0 ) = 1.Moreover, X is the value of this constant of nature for a randomly generated universe, with a distribution that either incorporates external information (H 1 ) or not (H 0 ).In the context of proteins, x is taken to be an amino acid sequence, whereas f (x) in (6) quantifies whether the protein that the amino acid corresponds to is functional (1) or not (0).For instance, X could be the outcome of a random evolutionary process, the goal of which is to generate a functioning protein, and this process either makes use of external information (H 1 ) or not (H 0 ).Several other applications are given in Section 5, including a more refined biological example, where x corresponds to a protein complex or a molecular machine.

Interpretation of target
There are at least two ways of interpreting x 0 , and hence also the target set A. According to the first interpretation, x 0 is the outcome of random variable X ∈ Ω; that is, the outcome of a first search.Suppose X is another random variable that represents a second (possibly future) search, independent of X .Then, if we condition on the outcome x 0 of the first search, the actinfo I + in (1) is the log likelihood ratio for the event that the second search variable X is at least as specified as the observed value f (x 0 ) of the first search.
There is no need to associate x 0 in (5) with a first search variable X though.Instead, some apriori information may be used to define which values of f represent a high amount of specificity.This gives rise to the second interpretation of x 0 , according to which x 0 is used for defining outcomes with a high and low degree of specificity, using f 0 = f (x 0 ) as a cutoff.According to this interpretation, the two sets A in (5) and its complement represent a dichotomization of specificity, so that A and A c consist of all states with high and low specificity respectively.With this interpretation of x, I + is the log likelihood ratio for testing FT based on the search variable X.In particular, suppose that the specificity function f is bounded, i.e.
Then the most stringent definition of high specificity, only regards outcomes with a maximal value of f as highly specified, so that Note that ( 6) is a special case of (9).

Active information for exponentially tilted systems
Throughout Section 3, ξ is assumed to be known and the null distribution does not involve any time index t.Therefore, P 0 is known, whereas P = P θt involves the tuning parameters θ and the time index t.It will further be assumed in Sections 3.1-3.2that the system is in equilibrium, so that the time index t can be dropped also under H 1 (P = P θ ).

Exponential tilting
Let P θ be an exponentially tilted version of P 0 for some scalar tuning parameter θ > 0, which will also be called tilting parameter.Exponential tilting is often used for rare events simulation [31,32].Here f is used to define the tilted version of P 0 as with a normalizing constant assuring that P θ is a probability measure.For finite sample spaces Ω, we interpret P 0 (x) and P θ (x) as probability masses, whereas for continuous sample spaces they are probability densities, and the sum in (11) is replaced by an integral.The larger the tilting parameter θ > 0, the more the probability mass of P θ concentrates on regions of large f .In particular, P ∞ , the weak limit of P θ as θ → ∞, is supported on (9) whenever ( 7) holds.
The parametric family of distributions is an exponential family [33, Section 1.5], and each P θ ∈ P gives rise to a separate version of actinfo.This is summarized in the following proposition: Proposition 1. Suppose the target set A is defined as in (5) for some x 0 ∈ Ω such that P 0 (A) > 0.
The intuitive interpretation of Proposition 1 is that the larger θ is, the more problem specific knowledge is infused into P θ in terms of shifting probability mass towards regions in Ω where f , the specificity function, is large.

Metropolis-Hastings systems with exponential tilting equilibrium
Inspired by Markov Chain Monte Carlo methods [34], consider a Markov chain X 0 , X 1 , . . .∈ Ω for which P θ is the equilibrium distribution.Consequently, if P = P θ (that is, under the alternative hypothesis H 1 in (2) when θ > 0), X = X t may be interpreted as the outcome of an algorithm after t iterations, provided t is so large that equilibrium has been reached.The assumption is made that this algorithm knows f and tries to explore the whole state space Ω.If the Markov chain has an equilibrium distribution (10), this corresponds to an algorithm that favors jumps towards regions of large f when θ > 0, more so the higher the value of θ is.In more detail, the transition kernel of the chain is an instance of the well-known Metropolis-Hastings (MH) algorithm [35,36], which is closely related to simulated annealing [37].This kernel has a probability or density for jumps from x to y, where δ(x, •) is a point mass at x ∈ Ω, q(x, •) is a proposal distribution of jumps from a current position x of the Markov chain, is the probability of accepting a proposed move from x to y, whereas is the probability that the Markov chain rejects a proposed move away from x (for continuous sample spaces q(x, •) is a probability density and then the sum in ( 16) is replaced by an integral).
The transition of the Markov chain from X t = x to the next state X t+1 is described in two steps as follows.First a candidate Y ∼ q(x, •) is proposed.Then in the second step this candidate is either accepted with probability α θ (x, Y), so that X t+1 = Y, or it is rejected with probability 1 − α θ (x, Y), so that X t+1 = X t .It is well known that P θ is the equilibrium distribution of this Markov chain whenever it is irreducible; that is, provided the proposal distribution q is defined in such a way that moving between any pair of states in Ω in a finite number of steps is possible [38, pp. 243-245].
In particular, if q is symmetric and P 0 is uniform, then a proposed upward move with f (Y) > f (x) and P θ (Y) > P θ (x) is always accepted, whereas a proposed downward move with f (Y) < f (x) is accepted with probability P θ (Y)/P θ (x).The Markov chain only makes local jumps if q(x, •) puts all its probability mass in a small neighborhood of x, for any x ∈ Ω.At the other extreme is a chain with the global proposal distribution q(x, •) ∼ P θ for any x ∈ Ω; all proposed jumps of this chain are then accepted (α(x, y) = 1), and {X t } ∞ t=1 is a sequence of independent and identically distributed (i.i.d.) random variables with X t ∼ P θ .

Active information for Metropolis-Hastings systems in non-equilibrium
Suppose for simplicity that the sample space Ω is finite, and that the states in Ω are listed in some order.Let be a row vector of length |Ω| with all the null distribution probabilities, and let be a square matrix of order |Ω| that defines the transition kernel of the Markov chain {X t } ∞ t=0 of Section 3.2.If X 0 ∼ P 0 , then by the Kolmogorov-Chapman equation X t ∼ P θt , where Hence, if P = P θt , then X = X t corresponds to observing the Markov chain at time t, under the alternative hypothesis H 1 in (3).Some basic properties of the corresponding actinfo are summarized in the following proposition: Proposition 2. Suppose X = X t is obtained by iterating t times a Markov chain with initial distribution (17) and transition kernel (18).The actinfo then equals where v is a column vector of length |Ω| with ones in positions x ∈ A and zeros in positions x ∈ A c .In particular, I + (θ, 0) = 0 and Therefore, I + (θ, t) > 0 corresponds to knowledge of f being used to generate t jumps of the Markov chain, under the alternative hypothesis H 1 in (3).

Active information for Metropolis-Hastings systems with stopping
In Section 3.3, P ∼ P θt was obtained by starting a random search with null distribution P 0 , and then iterating the Markov chain of Section 3.2 t times.However, knowledge of f can be utilized even more and stop the Markov chain if the target A in ( 5) is reached before time t.This can be formalized by introducing the stopping time and letting be the probability distribution of the stopped Markov chain X t∧T , with the last index s in ( 23) being an acronym for stopping.In particular, is the probability of reaching the target A for the first time after t iterations or earlier.The theory of phase-type distributions can then be used to compute the target probability P θts (A) in ( 23) [39,40].To this end, clump all states x ∈ A into one absorbing state, and decompose the transition kernel in (18) according to where Π na θ is a square matrix of order |A c | containing the transition probabilities between all non-absorbing states in A c , whereas Π na,a θ is a column vector of length |A c | with transition probabilities π(x, A) from all the non-absorbing states x ∈ A c into the absorbing state A. Moreover, P na 0 = (P 0 (x); x ∈ A c ) is a row vector of length |A c | that is the restriction of the start-distribution P 0 in (17) to all non-absorbing states.Then where 1 is a column vector of |A c | ones.The actinfo I + s of a search procedure with stopping is thus defined: Proposition 3. Suppose X = X t is obtained by iterating a Markov chain with initial distribution (17) and transition kernel (18) (for some θ ≥ 0) at most t times, and stopping whenever the set A is reached.
Then the actinfo is given by with P 0 and v as in Proposition 2, whereas P na 0 , Π na θ , and 1 are defined below (25) and (26).This actinfo satisfies and I + s (θ, t) is a non-decreasing function of t such that Inequality (28) states that, for a search procedure with t iterations, knowledge about f that is used for stopping the Markov chain in (18) will increase the actinfo, regardless of whether knowledge about f was used (θ > 0) or not (θ = 0) when iterating the Markov chain.Equation ( 29) is a consequence of the fact that the target A is reached eventually with probability 1, so that the actinfo of a search procedure with stopping equals the functional information I f 0 = − log P 0 (A) after many iterations of the Markov chain.Moreover, equation (30) tells that the rate at which P 0 (A)e I + s (θ,t) approaches 1 is determined by the expected waiting time E(T) of reaching the target.
From Proposition 3, actinfo for a system with stopping is closely related to the phase-type distribution of the waiting time T until the target is reached.This has been studied in [41], in the context of gene expression of a number of genes, with x the collection of regulatory regions of all these genes.

Estimating active information and testing fine-tuning
Suppose the random search algorithm is repeated independently, under the same conditions, n times.For instance, suppose {X it } ∞ t=0 correspond to independent realizations i = 1, . . ., n of a search algorithm.The outomes of these independent searches are either X i = X it or X i = X i,t∧T i , for i = 1, . . ., n, depending on whether the search algorithm is stopped at a fixed time point t or at random time points {T i } n i=1 .In either case, an output of i.i.d random variables X 1 , . . . , is obtained.These repeated outcomes of the search algorithm will be used to test for and estimate the degree of fine-tuning.The methodology depends on whether the null distribution P 0 is known or involves unknown nuisance parameters.

Null distribution known
Suppose the null distribution P 0 is known.The sample in ( 31) is then used for testing between the two hypotheses with the set of distributions that correspond to fine-tuning.Suppose an estimate Q(A) of the probability that X ∈ A is computed from data (31), with an associated empirical actinfo If Q(A) is a consistent estimator of Q(A), then for large sample sizes Î+ will be close to which equals 0 under H 0 and I + = I + P under H 1 , for some particular P ∈ P 1 .To test H 0 against H 1 , reject H 0 when Î+ ≥ I min , (36) where I min is a pre-specified lower bound on the range of values of the actinfo that correspond to FT.

Nonparametric estimator and test
From Section 3, P = P θ , P = P θt or P = P θts involves the tilting parameter θ, and possibly also the number of iterations t of the algorithm and a stopping time T. In this section, no other assumption than P ∈ P 1 is made on P, and a nonparametric version of the empirical actinfo is used.The fraction of random searches that fall into A is used as an estimate of Q(A).Therefore, (37) only requires knowledge of the set A, not of the function f .The following result establishes asymptotic normality of the nonparametric version of the estimator Î+ in (34).Moreover, large deviations [42] are used to show that the significance level of the nonparametric version of the FT test (36) goes to zero exponentially fast with n: Proposition 4. Suppose the empirical actinfo Î+ n in (34) is computed non-parametrically, using (37) as an estimate of the target probability Q(A).Then Î+ n is an asymptotically normal estimator of I + Q in (35), in the sense that where L −→ refers to convergence in distribution, and is the variance of the limiting normal distribution.The significance level of the test (36) for fine-tuning, with threshold I min , satisfies where is the Kullback-Leibler divergence between Bernoulli distributions with success probabilities p min = P 0 (A) exp(I min ) and P 0 (A) respectively.
Remark 1.The conclusion of Proposition 4 is that the probability of observing actinfo that corresponds to fine-tuning by chance decays at rate e −Cn when the sample size n gets large.

Parametric estimator and test
Suppose there is a priori knowledge that P is close to the parametric exponential family P of distributions in ( 10)-( 12) for some value θ > 0 of the tilting parameter.A parametric test of actinfo is naturally defined.For this, compute first the maximum likelihood estimate of θ, and use it to define a parametric estimate of the target probability Q(A) that is inserted into (34) to define a parametric version of the empirical actinfo Î+ .As opposed to (37), the estimate (43) requires full knowledge of f .To analyze the properties of the estimator (34) and test (36), introduce where is the Kullback-Leibler divergence between Q and P θ .From (44), P θ * is the distribution in P that best approximates Q.In particular, θ * = θ if Q ∈ P and Q = P θ for some θ ≥ 0.
The following proposition shows that Î+ is an asymptotically normal estimator of I + (θ * ) in (13), which differs from I + Q in (35) whenever Q / ∈ P.Moreover, the proposition also provides large sample properties of the significance level of the test for actinfo: Proposition 5. Suppose the empirical actinfo Î+ n in (34) is computed parametrically, using an estimate (43) of the target probability Q(A).Then Î+ n is an asymptotically normal estimator of I + (θ * ), in the sense that where the variance of the limiting normal distribution is given by Moreover, the significance level of the parametric test for fine-tuning, based on (36) and (43), satisfies for where P min = P θ min , θ min < θ * is the solution of P θ min (A) = p min = P 0 (A) exp(I min ), M(φ) is given by (11), whereas p min is defined in (33).

Comparison between nonparametric and parametric estimates of actinfo
The two versions of empirical actinfo are complementary.The nonparametric version is preferable in the sense that it makes less assumptions about the distribution P of the random algorithm under H 1 , and in particular it is a consistent estimator of I + Q in (35).The parametric version of Î+ , on the other hand, is preferable when nQ(A) is small, since it makes use of all data in order to estimate Q(A), although it is not a consistent estimator of I + Q when Q / ∈ P. The asymptotic variances in (39) and (47), as well as the rates of exponential significance level decrease in ( 41) and ( 49), agree when Q = P θ * and f (x) = f 0 1 {x∈A} , which is a special case of (8).

Null distribution unknown
Suppose the null distribution P 0 = P 0ξ involves an unknown nuisance parameter ξ ∈ Ξ.The objective is then to test the two hypotheses where the set of distribution under the null and alternative hypotheses equals and ( 33) respectively.

One sample available
The actinfo cannot be consistently estimated if only one sample (31) is available.The best that can be done is to estimate a lower bound of I + , with P 0max (A) defined in (4) and Q(A) an estimate of Q(A).This estimator will have an asymptotic bias with ξ * the nuisance parameter that maxizes P 0ξ (A) [43].For the numerator of (53) either the nonparametric estimate of Q(A) in ( 37) can be used, or a parametric class of distributions can be used that involves a tuning parameter vector θ and a vector of nuisance parameters ξ.If Q is thought to be close to P, the parametric estimate of Q(A) is used that generalizes (43), with When the sample size n tends to infinity, the estimator (56) will converge to The following result is an extension of Propositions 4-5, when nuisance parameters ξ are added and a general type of tuning parameter θ (not necessarily a scalar tilting parameter) is used: Proposition 6. Suppose the null distribution P 0 = P 0ξ involves an unknown parameter ξ and the actinfo I + Q in ( 52) is estimated by Î+ n in (53), using an estimator Q(A) of the target probability Q(A) that is either nonparametric (37) or parametric (55).Given these assumptions, Î+ n is an asymptotically normal estimator, in the sense that The asymptotic bias B in ( 58) is defined in (54) whereas the asymptotic variance V is defined in (39) for the nonparametric estimator of I + Q , whereas for the parametric estimator of I + Q , with ψ θξ (x) = d log P θξ (x)/d(θ, ξ), (θ * , ξ * ) defined as in (57), and T refering to matrix transposition.Moreover, the significance level of the test (36) of FT, with threshold I min , satisfies for the nonparametric version of the test, with p min = P 0ξ (A) exp(I min ).For the parametric versions of the FT-test, and in the special case when θ is a scalar exponential tilting parameter, C is given by ( 49), with P min = P θ min ξ , and θ min the solution of P θ min ξ (A) = p min e −B .
Remark 2. The negative bias term B makes the test of FT in Proposition 6 more conservative than the tests in Propositions 4-5.This can be seen, for instance, by comparing the two large deviation rates C in (41) and (61).The rate in (61) is larger, since p min is multiplied by a term e −B .This corresponds to the fact that to falsely reject H 0 in Proposition 6 is more difficult.

Two samples available
In addition to the first sample (31), suppose a second sample X 01 , . . . ,X 0n 0 ∼ P 0ξ (62) of n 0 i.i.d.observations under the null distribution is available.A consistent estimator The following result provides asymptotic properties of the estimator (63) of actinfo, and the corresponding test (36) of FT with threshold I min : Proposition 7. Suppose the null distribution P 0 = P 0ξ involves an unknown nuisance parameter ξ, and that the active information I + Q in ( 52) is estimated by Î+ nn 0 in (63), making use of two samples ( 31) and (62), of sizes n and n 0 , from Q and P 0ξ respectively.Assume further that the estimator Q(A) of Q(A) is either nonparametric (37) or parametric (55).If n, n 0 → ∞ in such a way that where and ψ ξ (x) = d log P 0ξ (x)/dξ.If the nonparametric estimator of Q(A) is used, then V 1 equals V in (39), whereas if the parametric estimator Q(A) is used, then V 1 equals V in (59).The significance level of the test (36) of FT, with threshold I min , satisfies the same type of large deviation result (60) as in Proposition 6, for the nonparametric and parametric versions of the test (in the latter case assuming that θ is a scalar tilting parameter), but in the definitions of the nonparametric and parametric large deviation rates C, the bias term B = 0.

Examples
Example 1 (Cosmology [26,27]).Suppose there is a positive constant of nature X ∈ Ω = R + , a life-permitting interval A ⊂ Ω, and a specificity function ( 6) that equals 1 inside A = (a, b) and zero elsewhere.The maximum entropy distribution under a first moment constraint ξ = E(X) is exponential with expected value.Consequently, The null and alternative hypotheses for the fine-tuning test are given in (50), where under H 1 the agent brings about a life-permitting value of X with probability 1 (P(A) = 1).Only one universe is observed, with a value X = X 1 of the constant.Therefore, there is a sample (31) of size n = 1, whereas no null sample (62) is available.Since The estimate (53) of actinfo then simplifies to Let x = (a + b)/2 be the midpoint of the LPI and suppose that half of its relative size = (b − a)/(2x) is small.The probability in (68) is then approximated by From (68) the estimated actinfo is a monotone decreasing function of .
Example 2 (Evaluation of student test scores [44]).Suppose a number of students perform a test.
for a student with covariate vector z who prepared for the test for a period of length t.The nuiscance parameter vector ξ = (ξ 0 , . . ., ξ d−1 , σ 2 ) involves the error variance and the regression parameters for students who did not train for the test, whereas the tuning parameter vector θ = (θ 0 , . . ., θ d−1 ) involves the regression parameters that correspond to the effect of preparing for the test.The unconditional distribution of the response is normal, Y ∼ N(µ, V), with Therefore, the probability, for a randomly chosen student, that studied for the test for a period of length t, to pass is where Φ is the cumulative distribution function of a standard normal distribution.The null distribution P 0 = P 0ξ corresponds to putting t = 0 in (69).Thus the actinfo quantifies how much learning, during a period of length t, increases the probability of passing the test.To compute an estimate Î+ of I + in (70), estimates ξ and θ of ξ and θ are needed.This can be done by collecting two training samples, as in (63).Another option is computing least squares estimates of the nuisance and tuning parameters (ξ, θ) jointly, without bias, from one single data set {(t i , z i , y i )} n i=1 , provided that the time periods t i vary, so that all parameters are identifiable.
Example 3 (Reinforcement learning (RI) [45]).Consider an agent whose purpose is to maximize the reward f (x) of a trajectory x that he to some extent will be able to control, for a time period of length t.At each time point u there are m possible environments S = {s 1 , . . ., s m } and q possible actions A = {a 1 , . . ., a q } to take.The state space X = A t × S t+1 consists of all possible trajectories x = (a 0 , . . ., a t−1 , s 0 , . . ., s t ) of environments and actions, where s u is the environment and a u the action taken at time u.A corresponding random trajectory is denoted with capital letters X = (A 0 , . . ., A t−1 , S 0 , . . ., S t ).
If the environment of the system is S u = s at time u, and action A u = a is taken, the probability of moving to environment s is P a (s, s ) = P(S u+1 = s |S u = s, A u = u), with an instantaneous reward of R a (s, s ).If future rewards are discounted by a factor γ, the total reward, over a time horizon of length t, is Let f 0 be a lower bound for a trajectory's total discounted reward to be acceptable, so that A in ( 5) is the set of all acceptable trajectories.The agent takes action according to some policy to make the expected total reward of a trajectory as large as possible.To this end consider stationary policies, where the action A u taken by the agent at each time point u is only determined by the current environment s u , according to some matrix Π = (π(s, a); s ∈ S, a ∈ A) of transition probabilities π(s, a) = P(A u = u|S u = s).For a completely random policy π(s, a) = ξ a ; a = 1, . . ., q, the action is not influenced by the current environment, and it is completely specified by the vector ξ = (ξ 1 , . . ., ξ q ) of nuisance parameters.Thus P 0 (A) = P 0ξt ( f (X) ≥ f 0 ) is the probability that an ignorant agent with policy determined by ξ, will have an acceptable trajectory.An agent who knows the reward function R a and the dynamics P a of the environment, will try to take this knowledge into account to formulate a policy that makes the reward as large as possible.A deterministic policy θ : S → A is a function that to each environment takes a unique action, so that π(s, a) = 1 {a=θ(s)} .
Thus P(A) = P θt ( f (X) ≥ f 0 ) is the probability that an agent with deterministic policy θ obtains an acceptable trajectory.The active information quantifies, on a logarithmic scale, how much more likely it is for an agent with policy θ to obtain an acceptable trajectory, compared to an ignorant agent with policy ξ.The values ξ and θ are varied during the exploration phase of RI, but they are assumed to be known during the exploitation phase of RI.Suppose we want to compute the actinfo (71) during the expoitation phase.Since P 0 (A) and P(A) are typically unknown, they have to be estimated by Monte Carlo.
To this end, assume we have two samples (31) and (62) of n and n 0 trajectories available, from Q = P θt and Q = P 0ξt respectively.Then Î+ in (63) can be used to estimate the actinfo (71).
Example 4 (Molecular machines and Moran models [15,30,41]).Suppose Ω consists of all 2 d binary sequences x = (x 1 , . . ., x d ) of length d, with a null distribution P 0 (x) that will be chosen below.The specificity function f is defined as where |x| = ∑ d i=1 x i and a ≤ 1/d is a fixed parameter.We regard x as a molecular machine with d parts, with x i = 1 or 0 depending on whether part i functions or not.The specificity f (x) quantifies how well the machine works, for instance its ability to regulate activity in vitro or in vivo in a living cell.It is assumed that f (x) is determined by the number |x| of functioning parts, with a maximal value f max = f (1, . . ., 1) = 1.Using (8), the most stringent definition of high specificity, it follows that A = {(1, . . ., 1)} only contains one element, a molecular machine for which all parts are in shape.The parameter a is crucial.If 0 < a ≤ 1/d, it follows that a molecular machine works better the more of the parts that are in shape.On the other hand, if a < 0, then a molecular machine with some parts in shape, but not all, functions worse the more of the parts that are in shape, since all units must work in order for the whole machine to function, and there is a cost −a associated with carrying each part that is in shape, as long as the whole system does not function.
Each state x is interpreted as a population of N subjects, all having the same variant x of the molecular machine.With this interpretation, X = X t is the outcome of a random evolutionary process where all subjects of the population, at any time point t, have the same state.But this state may vary over time when all subjects of population simultaneously experience the same change.The question of interest is whether this process can modify the population so that all its members have a functioning molecular machine.A transition of this process from x is caused by a mutation with distribution q(x, •), where q(x, x) = 0. Suppose a mutation from x to y is possible, i.e., q(x, y) > 0. A mutation from x to y first occurs in one individual and then it either (momentarily) dies out with probability 1 − α θ (x, y) or it (momentarily) spreads to the whole population (gets fixed) with probability α θ (x, y) = C • e θ f (y) P 0 (y)q(y, x) e θ f (x) P 0 (x)q(x, y) where e θ f (y) P 0 (y)q(y, x) e θ f (x) P 0 (x)q(x, y) is a constant assuring that (73) never exceeds 1, and the maximum is taken over all x, y such that x = y and both of q(x, y) and q(y, x) are positive.The Markov chain with transition probabilities ( 14) and acceptance probability (73) represents the dynamics of the evolutionary process.
As shown in Section 7, the equilibrium distribution of this Markov chain is given by P θ in (10).In particular, Propositions 2-3 remain valid when the Markov chain (14) with acceptance probabilities (73) are used, rather than (15).We will interpret as the selection coefficient or fitness of individuals with a molecular machine of type x, that is, s(x) is proportional to the fertility rate of individuals of type x.
The MH-type Markov chain with acceptance probability (73)-( 74) represents an evolutionary process that closely resembles a Moran model with selection [46][47][48], which is frequently used for describing evolutionary processes (see Section 7).The Moran model is a continuous time Markov chain for a population with overlapping generations where individuals die at the same rate, and are replaced by offspring of individuals in the population proportionally to their selection coefficients s(x).New types arise when an offspring of parents of type x mutate with probability µ(x).If the mutation rate is small (µ(x) N −1 for all x ∈ Ω), then to a good approximation the whole population will have the same type at any point in time, a so called fixed state assumption.
Even though the Moran model is specified in continuous time, time can be discretized as t = 0, 1, 2, . . .by only recording the population when individuals die.If individuals die at rate 1, then the next individual dies at rate N, so that time is counted in units of N −1 generations.The fixed state assumption is motivated by assuming that newborn offspring with a new mutation either dies out or spreads to the whole population (get fixed in the population) right after birth.In this context, q corresponds to the way in which mutations change the type of the individual, whereas α θ = α θN is the probability of fixation.If q(x, y) is the conditional probability that an offspring of a type x parent mutates to y, given that a mutation occurs, then the proposal kernel of the Moran model is As shown in Section 7, the acceptance (or fixation) probability of the Moran model is when θ[ f (y) − f (x)] is small.From (76)-( 77), the Moran model approximates the Metropolis-Hastings kernel with acceptance probabilities (73)-(74) with good accuracy when i) µ(x) ≡ µ, ii) P 0 is uniform and iii) the proposal kernel q is symmetric (i.e.q(x, y) = q(y, x)), although the time scales of the two processes are different.More specifically, if i)-iii) hold, a time-shifted version of the Moran model approximates the MH-type model with acceptance probabilities (73)-(74), so that each time step of the MH-type Markov chain corresponds to C/µ generations of a Moran model.However, even under assumptions i)-iii) the stationary distribution of the Moran model differs slightly from P θ .The proposal kernel q(x, y) is assumed to be local and satisfying where e j = (0, . . ., 0, 1, 0, . . ., 0) is a row vector of length d with a 1 in position j ∈ {1, . . ., d} and zeros elsewhere, whereas x + e j refers to component-wise addition modulo 2, corresponding to a switch of component j of x.A change of component j from 0 to 1 is caused by a beneficial mutation, whereas a change from 1 to 0 corresponds to a deleterious mutation.Consequently, b > 0 is the ratio between the rates at which beneficial and deleterious mutations occur.The kernel q in (78) is symmetric only when beneficial and deleterious mutations have the same rate (b = 1).The more general case of asymmetric q is handled differently by the MH-type algorithm and the Moran model.Whereas the MH-type algorithm elevates the acceptance probability (73) of seldom-proposed states y (those y for which q(x, y) is small for many x), this is not the case for the acceptance probability (77) of the Moran model.To avoid that these states y are reached too often by the MH-type algorithm, the null distribution P 0 of no selection has to be chosen so that P 0 (y) is small for rarely proposed states (whereas the Moran model needs no such correction).Therefore P 0 in (73) will be chosen as the stationary distribution of a transition kernel (14) for which θ = 0 and all candidates are accepted (α 0 (x, y) = 1).That is, if Π0 refers to the transition matrix of such a Markov chain, the initial distribution P 0 in ( 17) is chosen as the solution of The null distribution P 0 = P 0b in (79) involves one single nuisance parameter ξ = b.In the special case when beneficial and deleterious mutations have the same rate (b = 1), this procedure generates a uniform distribution P 0 (x) ≡ 2 −d .On the other hand, states x with many functioning parts will be harder to reach by the Markov process Π0 when beneficial mutations occur less frequently than deleterious ones (0 < b < 1), resulting in smaller values of P 0 (x).The distribution under the alternative hypothesis, P = P θbt , involves the nuisance parameter b, the time point t at which the state of the population is recorded, and θ = (a, θ), the two parameters that determine how much background information the MH-type evolutionary algorithm makes use of.For simplicity a and b are here regarded as constants and we only include θ and t in the notation.This gives rise to an active information The MH-type algorithm is studied for d = 5, and illustrated in Figures 1-3.Note that the functional information I f 0 is a decreasing function of b, since it is more surprising to find a working molecular machine by chance when the rate of beneficial mutations b is small.Moreover, the active information I + (θ) = lim t→∞ I + (θ, t) for the equilibrium distribution of the Markov chain as well as the active informations I + (θ, t) and I + s (θ, t) for a system in nonequilibrium, without and with stopping, are increasing functions of θ, and decreasing functions of a and b.The smaller a or b is, the more external information can be infused to increase the probability of reaching the fine-tuned state of a working molecular machine (1, . . ., 1).When a is small, to leave this state once it is reached becomes more difficult, and consequently I + s (θ, t) is only marginally larger than I(θ, t).Example 5 (Evolutionary programming algorithms).Suppose Ω = Ω N ind is a set of genetic variants from some genomic region, x = (x 1 , . . ., x N ), for the members of a population of size N.That is, x k ∈ Ω ind is the variant of this genomic region for individual k.If, for instance, the region codes for the molecular machine of Example 4, we let x k = (x k1 , . . ., x kd ) ∈ {0, 1} d = Ω ind , with x kj = 1 or 0 depending on whether component j of this machine works or not for individual k.Let g(x k ) be the biological fitness, or expected number of offspring, of k.In the context of molecular machines, the logarithm of g(x k ) could be a function of the number of functioning parts of a machine of type x k .The specificity function of a population in state x is the average fitness of its individuals.The targeted set A in ( 5) corresponds to all genetic profiles with an average fitness at least f 0 .This type of model is frequently used in genetic programming as well as in other types of evolutionary programming algorithms to mimic the evolution of N individuals over time [49,50].Typically, the output X = X t of the evolutionary algorithm is the last step of a simulation {X s = (X s1 , . . ., X sN )} t s=0 of the population over t generations.Once the distributions P 0 = P 0ξt and P = P θξt of X are found under the null hypothesis H 0 and the alternative hypothesis H 1 , the actinfo I + can be computed, according to (1).This actinfo quantifies, on a logarithmic scale, how much more likely it is for the average fitness of the population to exceed f 0 at time t, for a population with externally infused information (H 1 ) compared to an evolutionary process where no such external information is used (H 0 ).For instance, if a molecular machine needs all its parts in order to function (g(x k ) = 1(|x k | = d)), then the actinfo at time t equals Since the state space Ω is very large, it is often complicated to find explicit, analytical expressions for the actinfo I + in (81).Suppose the nuisance parameters ξ of the null distribution P 0 = P 0ξ are known.This makes the framework of Section 4.1 applicable, running the evolutionary algorithm n times.That is, n i.i.d.copies {X is } t s=0 of the population trajectory are generated up to time t for i = 1, . . ., n.Then X i = X it = (X it1 , . . ., X itN ), i = 1, . . ., n, are used for computing an estimate Î+ n of the actinfo, and test for fine-tuning, according to Section 4.1.Recall the fixed state assumption of Example 4, whereby all individuals of the population, at any time point, have the same state.Such an assumption is only realistic when Nµ 1, that is, when either the mutation rate µ and/or the population size N is small.It corresponds to a scenario where P 0 and P put all their probability masses along the diagonal of Ω.Since (82) is equivalent to the reduced state space Ω ind , the fixed state assumption greatly simplifies the analysis.For instance, it often makes it possible to find analytical expressions for the actinfo I + , rather than having to estimate it.

Discussion
In this article a general statistical framework is provided for using active information to quantify the amount of pre-specified external knowledge an algorithm makes use of, or equivalently, how tuned the algorithm is.The theory is based on quantifying, for each state x, how specified it is by means of a real-valued function f (x).An algorithm with external information either makes use of knowledge of f directly, or at least it incorporates knowledge that tends to move the output of the algorithm towards more specified regions.The Metropolis-Hastings Markov chain incorporates knowledge of f directly, in terms of the acceptance probability of proposed moves.The learning ability of this algorithm was analyzed by studying its active information, with or without stopping, when the targeted set of highly specified states is reached.When independent outcomes of an algorithm are available, nonparametric and parametric estimators of the actinfo of the algorithm were also developed, as well as nonparametric and parametric tests of FT.
This work can be extended in different ways.A first extension is to find conditions under which the actinfo I + (θ, t) of a stochastic algorithm based on a random start (according to the null distribution of a non-guided algorithm) followed by t iterations of the Metropolis-Hastings Markov chain (without stopping) is a non-decreasing function of t.We conjecture that this is typically the case but have not obtained any general conditions on the distribution q of proposed candidates for this result to hold.
A second extension is to widen the notion of specificity, so that not only the functionality f (x) but also the rarity P 0 (x) of the outcome x under the null distribution is taken into account.A class of such specificity functions is where θ > 0 is a parameter that controls the tradeoff between scenarios where either functionality or rarity under the null is the most important determinant of specificity.The case θ = 0 in (83) corresponds to function having no impact, so that g 0 (x) reduces to Shannon's self information of x.The case g 1 (x) was proposed in [15], whereas g θ (x) is solely determined by f (x) in the limit when θ gets large.A third extension is to generalize the notion of actinfo to include not only the probability of reaching a targeted set of highly specified states A under H 0 and H 1 , but also account for the conditional distribution of the states within A, given that A has been reached.This is related to the way in which functional sequence complexity generalizes functional information [51][52][53][54].
given that A has been reached.The functional sequence complexity is the reduction in entropy, under the null hypothesis H 0 of the highly specified states in A, compared to the entropy under H 0 of all states in Ω. FSC 0 then reduces to the functional information I f 0 when P 0 is uniform over Ω.In a similar vein, the active uncertainty reduction is introduced: Then UR + = I + when P 0A and P A are uniformly distributed on A. This happens, for instance, when P 0 has a uniform distribution on Ω and P = P θ for some θ > 0, and if (8) holds.The properties of UR + deserve to be analyzed in more detail, for instance investigate how it differs from the actinfo I + .
A fourth extension would be to apply the concept of actinfo to other genetic models.For instance, Example 4 is the first time that, to our knowledge, actinfo is applied to the Moran model.In the past though, actinfo was used in population genetics to study fixation times for the Wright-Fisher model of population genetics, a model for which time is discrete and generations do not overlap [55].

Proofs
Proof of Proposition 1. Introduce when Ω is finite, and replace the sums in (84) by integrals when Ω is continuous.Then Since P 0 (A) < 1, it follows that J(θ) is a strictly decreasing function of θ ≥ 0, whereas K(θ) is a non-decreasing function of θ.From this, it follows that P θ (A) is a strictly increasing function of θ, and consequently I + (θ) = log[P θ (A)/P 0 (A)] is a strictly increasing function of θ as well.
Moreover, K(θ) ≥ P 0 (A) > 0 for all θ ≥ 0, and J(θ) → 0 as θ → ∞ follows by dominated convergence.In conjunction with (85) this implies P θ (A) → 1 and Proof of Proposition 2 Equation (20) follows from ( 17), (19) and the fact that since v is a column vector of length |Ω| with ones in positions x ∈ A and zeros in positions x ∈ A c .Equation ( 21) is equivalent to proving that But this follows from the fact that P θ is the equilibrium distribution of the Markov chain with transition kernel (18).That is, letting t → ∞ in (19) we find that and therefore Proof of Proposition 3 Equation ( 28) follows from the definitions of I + (θ, t) and I + s (θ, t) in ( 20) and ( 27), and the fact that where the inequality is a consequence of the definition of T in (22).Since P θts (A) = P(T ≤ t) ≤ P(T ≤ t + 1) = P θ,t+1,s (A), we have proved that I + s (θ, t) is non-decreasing in t.Equation ( 29) follows from the definition of I + s (θ, t) and the fact that lim The last equality of (86) is a consequence of the fact that the Markov chain with transition kernel Π θ is irreducible, so that any state x ∈ Ω will be reached with probability 1.In particular, the targeted set A will be reached with probability 1.In order to verify (30), we first deduce It is a standard result from the asymptotic theory of maximum likelihood estimation and M-estimation (see, e.g., Chapter 6 of [33]) that (90) holds with asymptotic variance To simplify (92), notice that the score function can be written as for the exponential family of tilted distributions ( 10)- (11).From this it follows that is a constant, not depending on x.Inserting the last two displayed equations into (92), the formula in (91) for the asymptotic variance of θ is obtained.As a next step we notice that where follows from the definition of P θ (x) in (10).Differentiating (96) with respect to θ, we find that And it follows from the RHS of (97) that Then we combine (95) and (97), and obtain Finally we use the Delta Method to conclude that Î+ is an asymptotic normal estimator (38) of I + (θ * ), with asymptotic variance V = g (θ * ) 2 U, which, in view of (91) and (99), agrees with (47).
In order to prove the large deviation result (48) for the parametric test of FT, let θ min be the value of the tilting parameter that satisfies P θ min (A) = p min = P 0 (A) exp(I min ).Then notice that where in the third step we utilized that θ ≥ θ min is equivalent to the derivative of the log likelihood of data being non-negative at θ min , and in the fourth step we made use of (93) and introduced p min = P θ min .But this last line is a large deviations probability.It then follows from large deviations theory that (48) holds, with C the Legendre-Fenchel transformation in (49).
Proof of Proposition 6.Since the bias corrected empirical actinfo behaves like (34), with P 0 = P 0ξ , the asymptotic normality result for the nonparametric version of the estimator of I + Q follows from Proposition 4.
Since p min = P 0ξ (A) exp(I min ), we have that From this and (100) it follows that the nonparametric test of FT behaves as the corresponding nonparametric test of Proposition 4, with the null probability P 0 (A) replaced by P 0ξ (A), and p min replaced by p min e −B .Therefore, the large deviation result (61) follows from (41).In a similar way, the large deviation result for the parametric version of the FT-test (in the special case when θ is a scalar exponential tilting parameter) follows from (100), (105) and Proposition 5.
Proof of Proposition 7. Because of ( 52) and (63) we have that where and √ n 0 log P 0 ξ (A) respectively.It follows from the proofs of Propositions 4-5 that the asymptotic variance for V 1 in (107) is the same as V in ( 39) and ( 59), for the nonparametric and prametric versions of Q(A) respectively.The asymptotic variance V 2 in (108) is given by (67).This is proved using the delta method (similarly as for Proposition 6), making use of the fact that ξ is the maximum likelihood estimator of ξ with asymptotic variance that is the inverse E[ψ T ξ (X)ψ ξ (X)] −1 of the Fisher information matrix.The asymptotic normality result (66) then follows from (106)-(108), the fact that n/n 0 → λ, and the independence of the two samples.
The large deviations results are proved in a similar way as in Proposition 6, replacing P 0max (A) by P 0 ξ (A).Using statistical consistency ξ p −→ ξ as n 0 → ∞, it follows that the large deviation rates C of Proposition 7, for the nonparametric and parametric versions of the FT tests, are the same as in Proposition 6, with bias term B = 0.
Details from Example 4. In order to prove that the Metropolis-Hastings type Markov chain (14) with acceptance probabilities (73) has equilibrium distribution P θ , we first notice that for any pair of states x = y, the flow of probability mass P θ (x)π θ (x, y) = P θ (x)q(x, y)α θ (x, y) = P 0 (x)e θ f (x) M(θ) q(x, y) • C e θ f (y) P 0 (y)q(y, x) e θ f (x) P 0 (x)q(x, y) = C e θ f (x) P 0 (x)q(x, y)e θ f (y) P 0 (y)q(y, x) from x to y is symmetric with respect to x and y.Therefore, the flow P θ (y)π θ (y, x) of probability mass in the opposite direction, from y to x, is the same as in (109).A Markov chain with this property is called reversible [57, pp. 11-12].But it is well known that P θ is a stationary distribution if the Markov chain is reversible with reversible measure P θ [58, p. 238].If, additionally, the proposal distribution q is such that it is possible to move between any pair of states in a finite number of steps, it follows that the Markov chain is irreducible and hence that P θ is its unique stationary distribution, which is also the equilibrium distribution of the Markov chain [58, p. 232].Next we will motivate formula (77) for the acceptance probability of a Moran model.Assume that the population evolves over time as a Moran model, and that all individuals have type x.
If one individual mutates from x to y, because of (75), the relative fitness between the N − 1 individuals of type x and the newly mutated individual of type y is s = e θ f (y)/N e θ f (x)/N = e θ[ f (y)− f (x)]/N .(110) From the theory of Moran models (e.g., [41,59]), it is well known that the fixation probability for the newly mutated individual is Inserting (110) into (111) we find (when s = 1, or equivalently when ∆ = θ[ f (y) − f (x)] = 0), that which is equivalent to (77).

C
= p min e −B log p min e −B P 0ξ (A) + (1 − p min e −B ) log 1 − p min e −B 1 − P 0ξ (A) Let x = (z, y) = (z 1 , . . ., z d−1 , y) ∈ R d summarize the chararcteristics of a student with covariates z that are used to predict the outcome y of the test.The specificity function f (x) = x d = y equals the student's test score, and (5) corresponds to the set of students that pass the test, with a minimally allowed score of f 0 .The population of students follows a (d − 1)-dimensional multivariate normal distribution Z ∼ N(m, Σ), where m = (m 1 , . . ., m d−1 ) and Σ = (σ jk ) d−1 j,k=1 are known.The conditional distribution of the response follows a multiple linear regression model