Next Article in Journal / Special Issue
Divergence-Based Locally Weighted Ensemble Clustering with Dictionary Learning and L2,1-Norm
Previous Article in Journal
Automated Emotion Identification Using Fourier–Bessel Domain-Based Entropies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Assessing, Testing and Estimating the Amount of Fine-Tuning by Means of Active Information

by
Daniel Andrés Díaz-Pachón
1 and
Ola Hössjer
2,*
1
Division of Biostatistics, University of Miami, Miami, FL 33136, USA
2
Department of Mathematics, Stockholm University, 114 19 Stockholm, Sweden
*
Author to whom correspondence should be addressed.
Entropy 2022, 24(10), 1323; https://doi.org/10.3390/e24101323
Submission received: 23 August 2022 / Revised: 15 September 2022 / Accepted: 19 September 2022 / Published: 21 September 2022
(This article belongs to the Special Issue Recent Advances in Statistical Theory and Applications)

Abstract

:
A general framework is introduced to estimate how much external information has been infused into a search algorithm, the so-called active information. This is rephrased as a test of fine-tuning, where tuning corresponds to the amount of pre-specified knowledge that the algorithm makes use of in order to reach a certain target. A function f quantifies specificity for each possible outcome x of a search, so that the target of the algorithm is a set of highly specified states, whereas fine-tuning occurs if it is much more likely for the algorithm to reach the target as intended than by chance. The distribution of a random outcome X of the algorithm involves a parameter θ that quantifies how much background information has been infused. A simple choice of this parameter is to use θ f in order to exponentially tilt the distribution of the outcome of the search algorithm under the null distribution of no tuning, so that an exponential family of distributions is obtained. Such algorithms are obtained by iterating a Metropolis–Hastings type of Markov chain, which makes it possible to compute their active information under the equilibrium and non-equilibrium of the Markov chain, with or without stopping when the targeted set of fine-tuned states has been reached. Other choices of tuning parameters θ are discussed as well. Nonparametric and parametric estimators of active information and tests of fine-tuning are developed when repeated and independent outcomes of the algorithm are available. The theory is illustrated with examples from cosmology, student learning, reinforcement learning, a Moran type model of population genetics, and evolutionary programming.

1. Introduction

When Gödel published their incompleteness theorems [1], the mathematical world was shaken such that to date it has neither recovered nor fully assimilated the consequences [2]. Hilbert’s program to base mathematics on a finite set of axioms had previously been pursued by Alfred North Whitehead and Bertrand Russell [3]. However, this approach turned out to be wrong when Gödel proved that no finite set of axioms in a formal system can prove all its true statements, including its own consistency. At a similar but lesser scale, when David Wolpert and William MacReady published their No Free Lunch Theorems (NFLTs, [4,5]), there was disquiet in the community because these results imply that there is no one-size-fit-all algorithm that can do well in all searches [6], and thus that a “theory of everything” is not possible in machine learning. Wolpert and MacReady concluded that it was necessary to incorporate “problem-specific knowledge into the behavior of the algorithm” [5]. Thus, active information (actinfo) was introduced in order to measure the amount of information carried out by such problem-specific knowledge [7,8]. More specifically, the NFLTs say that no search works better on average than a blind search, i.e., a search according to a uniform distribution. Accordingly, actinfo is defined as
I + = log P ( A ) P 0 ( A ) ,
where A Ω is the non-empty target of the search algorithm, a subset of the finite sample space Ω , and P 0 is a uniform probability measure ( P 0 ( A ) = | A | / | Ω | ). P must be seen here as the probability measure induced by the problem-specific knowledge of the researcher, whereas P 0 is the underlying distribution assumed in the NFLTs. This corresponds to absence of problem-specific knowledge, in accordance with Bernoulli’s Principle of Insufficient Reason (PoIR). An equivalent characterization of actinfo is the reduction in functional information
I + = I f 0 I f = log P 0 ( A ) ( log P ( A ) )
between algorithms that do not and do make use background knowledge. The name functional information was introduced by Szostak and collaborators [9,10]. It refers to applications wherein A corresponds to all outcomes of an algorithm that are functional according to some criterion. Then, I f 0 and I f are the self-information (measured in nats) of the event that an algorithm X produces a functional outcome, given that it was generated under P 0 and P, respectively.
Suppose we do not know whether problem-specific knowledge has been used or not when the random search X Ω was generated. This corresponds to a hypothesis testing problem
H 0 : X P 0 , H 1 : X P ,
where data are generated from distributions P 0 and P under the null and alternative hypotheses H 0 and H 1 , respectively. It follows from (1) that I + is the log likelihood ratio when testing H 0 against H 1 , if data are censored so that only X A is known.
When the sample space Ω is finite or a continuous, bounded subset of a Euclidean space, the PoIR can be motivated by the fact that the uniform distribution maximizes the Shannon entropy, since it thereby maximizes ignorance about the outcome of X. However, the uniform distribution is neither a feasible choice of P 0 for infinite, countable sample spaces nor for continuous, unbounded samples spaces. For this reason, actinfo was generalized to deal with unbounded spaces [11], by choosing P 0 to maximize Shannon entropy under side constraints ξ , such as the existence of various moments. This gives rise to a family of null distributions P 0 = P 0 ξ , with a ξ a nuisance parameter that has to be estimated or controlled for in order to estimate or give bounds for the active information.
Actinfo has also been used for mode detection in unsupervised learning, among other applications [12,13]. Based on previous works by Montañez [14,15], actinfo has been used in the past for testing hypotheses [16]. More specifically, P is regarded as a random measure in [16], so that I + is random as well and expressions for the tail probability of I + can be found.

1.1. Fine-Tuning

Fine-tuning (FT) was introduced by Carter in physics and cosmology [17]. According to FT, the constants in the laws of nature and/or the boundary conditions in the standard models of physics must belong to intervals of low probability in order for life to exist. Since its inception, FT has generated a great deal of fascination, seen in multiple divulgation books (e.g., [18,19,20,21]) and scientific articles (e.g., [22,23,24,25]). For a given constant of nature X, the connection between FT and active information can be described in three steps:
(i)
Establishing the life-permitting interval (LPI) A that allows the existence of life for the constant, with Ω = ( 0 , ) = R + or Ω = R denoting the range of values that this constant could possibly take, including those that do not permit life.
(ii)
Determining the probability P 0 ( A ) of such a LPI. If P 0 = P 0 ξ contains unknown parameters ξ , find an upper bound
P 0 max ( A ) = max ξ P 0 ξ ( A )
of P 0 ( A ) .
(iii)
Suppose that H 1 corresponds to an agent who uses background knowledge of what is required for life to exist in order to bring about a constant of nature X that definitely permits life ( P ( A ) = 1 ). The active information I + = I f 0 = log P 0 ( A ) is then a measure of how much background knowledge this agent has infused. Following [26,27], we conclude that X is finely tuned when the lower bound log P 0 max ( A ) of I + = I f 0 is large enough. That is, FT corresponds to infusing a high degree of background knowledge into a problem.
Fine-tuning has also been used in biology. Dingjan and Futerman explored it for cell membranes [28,29], whereas Thorvaldsen and Hössjer [30] formalized it for a large class of biological models. According to [30], a system is fine-tuned if it satisfies the two following requirements:
(a)
It has an independent specification;
(b)
It is very unlikely to occur by chance.

1.2. The Present Article

In this article, actinfo will not only be used in the algorithmic sense. It will also be employed for testing the presence of and estimating the degree of fine-tuning (FT) of a search algorithm or agent who brings about X. Our definition of FT relies on (a) and (b), and in order to formalize these two concepts, we introduce a specificity function f, which quantifies, in terms of f ( x ) , how specified an outcome x Ω is. The target A, on the other hand, is a set of highly specified states, that is, all states with a degree of specificity that exceed a given threshold f 0 . Then, I + in (1) is a test statistic for testing whether an algorithm has a much larger probability of reaching the set of highly specified states compared to a random search. This is a test of FT, since reaching the target corresponds to specificity (a), whereas reaching it with a much higher probability than expected by chance corresponds to (b).
To calculate I + , the distributions P 0 and P of the random search algorithm under H 0 and H 1 , respectively, need to be defined. As mentioned above, the null distribution P 0 is typically chosen according to some criterion, such as a maximizer of entropy, possibly with some extra constraints on moments for unbounded Ω , which was the strategy implemented in [26,27]. Another possibility is to choose P 0 as the equilibrium distribution of a Markov chain that models the dynamics of the system under the null hypothesis of no external input. In general, P 0 = P 0 ξ t involves a number of nuisance parameters ξ , and sometimes, also the time point t when an algorithm that does not make use of external information stops. The choice of P = P θ ξ t is problem specific, and it possibly involves the nuisance parameters ξ of the null distribution, the time point t when the algorithm stops, as well as the tuning parameters θ that correspond to infusing the background knowledge into the search problem. Therefore, in its most general form, the actinfo (1) is a function I + = I + ( θ , ξ , t ) of the tuning parameters θ , the nuisance parameters ξ , and the time point t.
This general framework has many applications based on different choices of f, A, P 0 , and P. For some models, f is a binary function that quantifies functionality, so that A is the set of objects of a certain type (e.g., universes, proteins, protein complexes, or cellular networks) that are functional or permit life, among the set Ω of all such objects.
Another possibility is to choose A as the set of populations x whose (expected) fitness f ( x ) exceeds a given threshold. In this setting, P 0 ξ t ( A ) corresponds to the probability that a randomly chosen population would evolve and reach target A of high fitness at time t, given that no background knowledge of the specificity function f is used to generate X, so that natural selection does not occur. The functional information I f 0 = log P 0 ξ t ( A ) corresponds to the amount of external information that an evolutionary algorithm infuses under H 1 , given that it brings about X so that A happens with certainty ( P ( A ) = 1 ) within time t. In this case, the population is finely tuned when I f 0 is large enough. More generally, we say that an evolutionary algorithm that generates X P = P θ ξ t after t time steps is finely tuned when I + ( θ , ξ , t ) is large enough. Typically, θ involves the selection parameters that determine to which extent a population evolves towards higher fitness.
A third possibility is to choose f ( X ) = X as the test score of a randomly chosen student, whereas A = [ f 0 , ) is the set of results of those students who pass the test with a score of at least f 0 . Assume that f ( X ) N ( ξ , 1 ) for a randomly chosen student who did not prepare for the test ( H 0 ), whereas f ( X ) N ( ξ + θ t , 1 ) for a randomly chosen student who prepared for the test for a period of length t ( H 1 ). Then, P 0 ( A ) = P 0 ξ ( A ) = 1 Φ ( f 0 ξ ) , whereas P ( A ) = P θ ξ t ( A ) = 1 Φ ( f 0 ξ θ t ) , where Φ is the cumulative distribution function of a standard normal distribution. In particular, the tuning parameter θ > 0 corresponds to the amount of knowledge that a student is expected to generate per unit time of study.
The unified treatment of search problems and FT of this paper, is organized as follows: Section 2 introduces the specification function f and the set A of highly specified states. Section 3 introduces a class of probability distributions P = P θ for which the specificity function f is used to exponentially tilt the null distribution P 0 , so that outcomes with high specificity are more likely to occur, and with a scalar tuning parameter θ of P θ that corresponds to the amount of exponential tilting. A proof is presented that it is possible to obtain a Metropolis–Hastings type Markov chain in discrete time t = 0 , 1 , 2 , , whose outcome X = X t at time t has the aforementioned exponentially tilted distribution under equilibrium, that is, when t is large. The corresponding actinfo I + ( θ , t ) is shown to increase monotonically with t towards an equilibrium limit. The actinfo of a search algorithm X = X t T that stops at time T, when the targeted set A of highly specified states has been reached, is also shown to increase more rapidly. Section 4 introduces various nonparametric and parametric estimators of actinfo, and corresponding tests of FT, when n repeated and independent outputs of the search algorithm are available. In particular, large deviations theory is used to prove that the significance levels of these tests, i.e., the probability of detecting FT under H 0 , goes to zero at an exponential rate when the sample size n increases. Section 5 presents a number of examples from cosmology, student learning, reinforcement learning, and population genetics, that illustrate our approach. A discussion in Section 6 follows, whereas the proof and further details about the models are presented in Section 7.

2. Specificity and Target

Consider a function f : Ω R and assume that the objective of the search algorithm, or the agent that brings about X, is to find regions in Ω where f is large. The rationale for this is an independent specification, where a more specified state x Ω corresponds to a larger f ( x ) . It is further assumed that the target set in (1) is given by
A = { x Ω ; f ( x ) f ( x 0 ) }
for some x 0 Ω . This implies that the purpose of the search algorithm or the agent is to bring about an X that is at least as specified as x 0 . We will refer to f as a specificity function of the agent or an objective function of the search algorithm.
Several examples of specificity functions are provided in Section 5. For instance, Example 2 deals with student learning. For a special case of this model, f ( x ) = x represents the test score of a student, whereas x 0 is a reference value that corresponds to the minimum score needed to pass the test.
For cosmological FT (Example 1), x is the value of a particular constant of nature and the specificity function equals
f ( x ) = 1 { x A } ,
where 1 { · } is the indicator function. That is, f has a binary range, with f ( x ) = 1 and 0 corresponding to whether x permits a universe with life, and in particular, x 0 is a universe that permits life. From this, A is the LPI of this constant. Moreover, X is the value of this constant of nature for a randomly generated universe, with a distribution that either incorporates external information ( H 1 ) or not ( H 0 ).
In the context of proteins, x is taken to be an amino acid sequence, whereas f ( x ) in (6) quantifies whether the protein that the amino acid corresponds to is functional (1) or not (0). For instance, X could be the outcome of a random evolutionary process, the goal of which is to generate a functioning protein, and this process either makes use of external information ( H 1 ) or not ( H 0 ). In a more refined example (Example 4), x is a molecular machine that consists of a possibly large number of proteins (or parts), and f ( x ) is (a monotone function of) the fitness of x.

Interpretation of Target

There are at least two ways of interpreting x 0 , and hence also the target set A. According to the first interpretation, x 0 is the outcome of random variable X Ω ; that is, the outcome of a first search. Suppose that X is another random variable that represents a second (possibly future) search, independent of X . Then, if we condition the outcome x 0 of the first search, the actinfo I + in (1) is the log likelihood ratio for the event that the second search variable X is at least as specified as the observed value f ( x 0 ) of the first search.
There is, however, no need to associate x 0 in (5) with a first search variable X . Instead, some a priori information may be used to define which values of f represent a high amount of specificity. This gives rise to the second interpretation of x 0 , according to which x 0 is used for defining outcomes with a high and low degree of specificity, using f 0 = f ( x 0 ) as a cutoff. According to this interpretation, the two sets A in (5) and its complement
A c = Ω   \   A = { x ; f ( x ) < f ( x 0 ) }
represent a dichotomization of specificity, so that A and A c consist of all states with high and low specificity, respectively. With this interpretation of x, I + is the log likelihood ratio for testing FT based on the search variable X. In particular, suppose that the specificity function f is bounded, i.e.,
f max = max x Ω f ( x ) < .
Then, the most stringent definition of high specificity,
f 0 = f max ,
only regards outcomes with a maximal value of f as highly specified, so that
A = Ω max = { x Ω ; f ( x ) = f max } .
Note that (6) is a special case of (9).

3. Active Information for Exponentially Tilted Systems

Throughout Section 3, ξ is assumed to be known and the null distribution does not involve any time index t. Therefore, P 0 is known, whereas P = P θ t involves the tuning parameters θ and the time index t. It will be further assumed in Section 3.1 and Section 3.2 that the system is in equilibrium, or that the time index t is fixed, so that t can also be dropped under H 1 ( P = P θ ).

3.1. Exponential Tilting

Let P θ be an exponentially tilted version of P 0 for some scalar tuning parameter θ > 0 , which will also be called a tilting parameter. Exponential tilting is often used for rare events simulation [31,32]. Here, f is used to define the tilted version of P 0 as
P θ ( x ) = e θ f ( x ) M ( θ ) P 0 ( x ) ,
with
M ( θ ) = x Ω e θ f ( x ) P 0 ( x )
a normalizing constant assuring that P θ is a probability measure. For countable sample spaces Ω , we interpret P 0 ( x ) and P θ ( x ) as the probability masses, whereas for continuous sample spaces, they are probability densities and the sum in (11) is replaced by an integral. The larger the tilting parameter θ > 0 , the more the probability mass of P θ concentrates on regions of large f. In particular, P , the weak limit of P θ as θ , is supported on (9) whenever (7) holds.
The parametric family
P = { P θ ; θ 0 }
of distributions is an exponential family [33] (Section 1.5), and each P θ P gives rise to a separate version of actinfo. This is summarized in the following proposition (cf. Section 7 for a proof):
Proposition 1.
Suppose the target set A is defined as in (5) for some x 0 Ω such that P 0 ( A ) > 0 . Then, P θ ( A ) is a strictly increasing function of θ 0 with P ( A ) = 1 . Consequently, the actinfo
I + ( θ ) = log P θ ( A ) P 0 ( A )
is a strictly increasing function of θ 0 , with I + ( 0 ) = 0 and I + ( ) = I f 0 = log P 0 ( A ) .
The intuitive interpretation of Proposition 1 is that the larger θ is, the more problem-specific knowledge is infused into P θ in terms of shifting probability mass towards regions in Ω where f, the specificity function, is large.
A simple instance of exponential tilting is the student learning example of Section 1.2. Recall that f ( x ) = x is the test score of a student, with X N ( ξ , 1 ) for a randomly chosen student who did not prepare for the test ( H 0 ), whereas X N ( ξ + θ , 1 ) is the test score of a randomly chosen student who prepared for the test during t = 1 units of time ( H 1 ). It is clear that
P 0 ( x ) = e ( x ξ ) 2 / 2 / 2 π , P θ ( x ) = e ( x ξ θ ) 2 / 2 / 2 π = P 0 ( x ) e θ x / M ( θ ) .

3.2. Metropolis–Hastings Systems with Exponential Tilting Equilibrium

Inspired by Markov Chain Monte Carlo methods [34], consider a Markov chain X 0 , X 1 , Ω for which P θ is the equilibrium distribution. Consequently, if P = P θ (that is, under the alternative hypothesis H 1 in (3) when θ > 0 ), X = X t may be interpreted as the outcome of an algorithm after t iterations, provided that t is so large that the equilibrium has been reached. The assumption is made that this algorithm knows f and tries to explore the whole state space Ω . If the Markov chain has an equilibrium distribution (10), this corresponds to an algorithm that favors jumps towards the regions of large f when θ > 0 , an effect which is accentuated the higher the value of θ is. In further detail, the transition kernel of the chain is an instance of the well-known Metropolis–Hastings (MH) algorithm [35,36], which is closely related to simulated annealing [37]. This kernel has a probability or density
π θ ( x , y ) = r θ ( x ) δ ( x , y ) + α θ ( x , y ) q ( x , y )
for jumps from x to y, where δ ( x , · ) is a point mass at x Ω , q ( x , · ) is a proposal distribution of jumps from a current position x of the Markov chain,
α θ ( x , y ) = min 1 , e θ f ( y ) P 0 ( y ) q ( y , x ) e θ f ( x ) P 0 ( x ) q ( x , y )
is the probability of accepting a proposed move from x to y, whereas
r θ ( x ) = 1 y Ω α θ ( x , y ) q ( x , y )
is the probability that the Markov chain rejects a proposed move away from x (for continuous sample spaces q ( x , · ) is a probability density and then the sum in (16) is replaced by an integral). The transition of the Markov chain from X t = x to the next state X t + 1 is described in two steps as follows. First, a candidate Y q ( x , · ) is proposed. Then, in the second step, this candidate is either accepted with a probability of α θ ( x , Y ) , so that X t + 1 = Y , or it is rejected with probability 1 α θ ( x , Y ) , so that X t + 1 = X t . It is well known that P θ is the equilibrium distribution of this Markov chain whenever it is irreducible; that is, provided the proposal distribution q is defined in such a way that moving between any pair of states in Ω in a finite number of steps is possible [38], pp. 243–245.
In particular, if q is symmetric and P 0 is uniform, then a proposed upward move with f ( Y ) > f ( x ) and P θ ( Y ) > P θ ( x ) is always accepted, whereas a proposed downward move with f ( Y ) < f ( x ) is accepted with a probability of P θ ( Y ) / P θ ( x ) . The Markov chain only makes local jumps if q ( x , · ) puts all its probability mass in a small neighborhood of x, for any x Ω . At the other extreme is a chain with the global proposal distribution q ( x , · ) P θ for any x Ω ; all proposed jumps of this chain are then accepted ( α ( x , y ) = 1 ), and { X t } t = 1 is a sequence of independent and identically distributed (i.i.d.) random variables with X t P θ .
The choice of proposal distribution q is problem specific. In this section, we defined q for the Metropolis–Hastings type algorithms that require knowledge of the specificity function f, since the acceptance probability (15) is a function of f. Proposed moves also occur for evolutionary algorithms (Examples 4 and 5 of Section 5). These algorithms are typically the result of many small changes, with specificity corresponding to functionality or fitness. The proposed moves are local mutations that either survive (are accepted) or do not. Other algorithms (such as reinforcement learning in Example 3 of Section 5) only make use of estimates of the specificity function. However, it is still meaningful for these algorithms to talk about proposed moves that are initially large (exploration phase) followed by a subsequent period of small or no moves (exploitation phase). In the context of Metropolis–Hastings algorithms, this is the strategy of simulated annealing, where large moves are initially proposed (corresponding to high temperatures), followed by subsequent small proposed moves (corresponding to low temperatures).

3.3. Active Information for Metropolis–Hastings Systems in Non-Equilibrium

Suppose, for simplicity, that the sample space Ω is finite, and that the states in Ω are listed in some order. Let
P 0 = ( P 0 ( x ) ; x Ω )
be a row vector of length | Ω | with all the null distribution probabilities, and let
Π θ = π θ ( x , y ) ; x , y Ω
be a square matrix of order | Ω | that defines the transition kernel of the Markov chain { X t } t = 0 of Section 3.2. If X 0 P 0 , then by the Kolmogorov–Chapman equation X t P θ t , where
( P θ t ( x ) ; x Ω ) = P θ t = P 0 Π θ t .
Hence, if P = P θ t , then X = X t corresponds to observing the Markov chain at time t, under the alternative hypothesis H 1 in (3). Some basic properties of the corresponding actinfo are summarized in the following proposition, which is proved in Section 7:
Proposition 2.
Suppose that X = X t is obtained by iterating t times a Markov chain with initial distribution (17) and transition kernel (18). The actinfo then equals
I + ( θ , t ) = log P θ t ( A ) P 0 ( A ) = log P 0 Π θ t v P 0 v ,
where v is a column vector of length | Ω | with ones in positions x A and zeros in positions x A c . In particular, I + ( θ , 0 ) = 0 and
lim t I + ( θ , t ) = I + ( θ ) .
Therefore, I + ( θ , t ) > 0 corresponds to knowledge of f being used to generate t jumps of the Markov chain, under the alternative hypothesis H 1 in (3).

3.4. Active Information for Metropolis–Hastings Systems with Stopping

In Section 3.3, P P θ t was obtained by starting a random search with null distribution P 0 , and then iterating the Markov chain of Section 3.2t times. However, knowledge of f can be utilized even more and stop the Markov chain if the target A in (5) is reached before time t. This can be formalized by introducing the stopping time
T = min { t 0 ; X t A }
and letting
P θ t s ( x ) = P ( X t T = x )
be the probability distribution of the stopped Markov chain X t T , with the last index s in (23) being an acronym for stopping. In particular,
P θ t s ( A ) = x A P θ t s ( x ) = P ( T t )
is the probability of reaching the target A for the first time after t iterations or earlier. The theory of phase-type distributions can then be used to compute the target probability P θ t s ( A ) in (23) [39,40]. To this end, clump all states x A into one absorbing state, and decompose the transition kernel in (18) according to
Π θ = Π θ na Π θ na , a 0 1 ,
where Π θ na is a square matrix of order | A c | containing the transition probabilities between all non-absorbing states in A c , whereas Π θ na , a is a column vector of length | A c | with transition probabilities π ( x , A ) from all the non-absorbing states x A c into the absorbing state A. Moreover, P 0 na = P 0 ( x ) ; x A c is a row vector of length | A c | that is the restriction of the start-distribution P 0 in (17) to all non-absorbing states. Then
P θ t s ( A ) = 1 P 0 na ( Π θ na ) t 1 ,
where 1 is a column vector of | A c | ones.
The actinfo I s + of a search procedure with stopping is thus defined:
Proposition 3.
Suppose that X = X t is obtained by iterating a Markov chain with an initial distribution (17) and a transition kernel (18) (for some θ 0 ) at most t times, and stopping whenever the set A is reached. Then, the actinfo is given by
I s + ( θ , t ) = log P θ t s ( A ) P 0 ( A ) = log 1 P 0 na ( Π θ na ) t 1 P 0 v ,
with P 0 and v as in Proposition 2, whereas P 0 na , Π θ na , and 1 are defined below (25) and (26). This actinfo satisfies
I s + ( θ , t ) I + ( θ , t )
and I s + ( θ , t ) is a non-decreasing function of t such that
lim t I s + ( θ , t ) = I f 0
and
t = 0 1 P 0 ( A ) e I s + ( θ , t ) = E ( T ) .
Proposition 3 is proven in Section 7. Inequality (28) states that, for a search procedure with t iterations, knowledge about f that is used for stopping the Markov chain in (18) will increase the actinfo, regardless of whether knowledge about f was used ( θ > 0 ) or not ( θ = 0 ) when iterating the Markov chain. Equation (29) is a consequence of the fact that target A is eventually reached with probability 1, so that the actinfo of a search procedure with stopping equals the functional information I f 0 = log P 0 ( A ) after many iterations of the Markov chain. Moreover, Equation (30) tells that the rate at which P 0 ( A ) e I s + ( θ , t ) approaches 1 is determined by the expected waiting time E ( T ) of reaching the target.
From Proposition 3, actinfo for a system with stopping is closely related to the phase-type distribution of the waiting time T until the target is reached. This has been studied in [41], in the context of the expression of a number of genes, with x being the collection of the regulatory regions of all these genes.

4. Estimating Active Information and Testing Fine-Tuning

In Section 3, we gave explicit expressions of the actinfo, for Metropolis–Hastings algorithms with a scalar tuning parameter θ . In general, however, it might be infeasible to calculate I + , either because the sample space is very large, or the nuisance parameters ξ and/or the tuning parameters θ are unknown. If is of interest then to consider ways of estimating I + from data, for instance through Monte Carlo-based methods. To this end, we will assume that the random search algorithm is repeated independently, under the same conditions, n times. For instance, suppose that { X i t } t = 0 corresponds to independent realizations i = 1 , , n of a search algorithm. If these independent realizations are recorded or stopped at one single time point, the outcome is either X i = X i t for i = 1 , , n , or X i = X i , t T i , for i = 1 , , n , depending on whether the search algorithm is stopped at a fixed time point t or at random time points { T i } i = 1 n . In either case, an output of i.i.d. random variables
X 1 , , X n Q
is obtained. These repeated outcomes of the search algorithm will be used to test for and estimate the degree of fine-tuning. The methodology depends on whether the null distribution P 0 is known or involves unknown nuisance parameters.

4.1. Null Distribution Known

Suppose the null distribution P 0 is known. The sample in (31) is then used for testing between the two hypotheses
H 0 : Q = P 0 , H 1 : Q P 1 ,
with
P 1 = { P ; P ( A ) p min }
the set of distributions that correspond to fine-tuning. Suppose an estimate Q ^ ( A ) of the probability that X A is computed from data (31), with an associated empirical actinfo
I ^ + = I ^ n + = log Q ^ ( A ) P 0 ( A ) .
If Q ^ ( A ) is a consistent estimator of Q ( A ) , then for large sample sizes, I ^ + will be close to
I Q + = log Q ( A ) P 0 ( A ) ,
which equals 0 under H 0 and I + = I P + under H 1 , for some particular P P 1 . To test H 0 against H 1 ,
reject H 0 when I ^ + I min ,
where I min is a pre-specified lower bound on the range of values of the actinfo that corresponds to FT.

4.1.1. Nonparametric Estimator and Test

In Section 3, P = P θ , P = P θ t , or P = P θ t s were used for distributions that make use of pre-specified knowledge. These distributions involve the tilting parameter θ , and possibly also the number of iterations t of the algorithm and a stopping time T. In this section, however, no other assumption than P P 1 is made on P, and a nonparametric version of the empirical actinfo is used. The fraction
Q ^ ( A ) = 1 n i = 1 n 1 { X i A }
of random searches that fall into A is used as an estimate of Q ( A ) . Therefore, (37) only requires the knowledge of the set A, not of the function f.
The following result establishes the asymptotic normality of the nonparametric version of the estimator I ^ + in (34). Moreover, large deviations [42] are used to show that the significance level of the nonparametric version of the FT test (36) goes to zero exponentially fast with n (see Section 7 for more details of the proof).
Proposition 4.
Suppose the empirical actinfo I ^ n + in (34) is computed nonparametrically, using (37) as an estimate of the target probability Q ( A ) . Then, I ^ n + is an asymptotically normal estimator of I Q + in (35), in the sense that
n ( I ^ n + I Q + ) L N ( 0 , V ) as n ,
where L refers to convergence in distribution, and
V = 1 Q ( A ) Q ( A )
is the variance of the limiting normal distribution. The significance level of the test (36) for fine-tuning, with threshold I min , satisfies
lim n log P H 0 ( I ^ n + I min ) n = C ,
where
C = p min log p min P 0 ( A ) + ( 1 p min ) log 1 p min 1 P 0 ( A )
is the Kullback–Leibler divergence between Bernoulli distributions with success probabilities p min = P 0 ( A ) exp ( I min ) and P 0 ( A ) , respectively.
Remark 1.
The conclusion of Proposition 4 is that the probability of observing actinfo that corresponds to fine-tuning by chance decays at rate e C n when the sample size n becomes large.

4.1.2. Parametric Estimator and Test

Suppose that there is a priori knowledge that P is close to the parametric exponential family P of distributions in (10)–(12) for some value θ > 0 of the tilting parameter. A parametric test of actinfo is naturally defined. For this, first compute the maximum likelihood estimate
θ ^ = θ ^ n = arg max θ 0 i = 1 n log P θ ( X i )
of θ , and use it to define a parametric estimate
Q ^ ( A ) = P θ ^ ( A )
of the target probability Q ( A ) that is inserted into (34) to define a parametric version of the empirical actinfo I ^ + . As opposed to (37), the estimate (43) requires the full knowledge of f.
To analyze the properties of the estimator (34) and test (36), introduce
θ * = arg min θ 0 D K L ( Q P θ ) ,
where
D K L ( Q P θ ) = x Ω Q ( x ) log Q ( x ) P θ ( x )
is the Kullback–Leibler divergence between Q and P θ . From (44), P θ * is the distribution in P that best approximates Q. In particular, θ * = θ if Q P and Q = P θ for some θ 0 .
The following proposition shows that I ^ + is an asymptotically normal estimator of I + ( θ * ) in (13), which differs from I Q + in (35) whenever Q P . Moreover, the proposition also provides large sample properties of the significance level of the test for actinfo (cf. Section 7 for details of the proof):
Proposition 5.
Suppose the empirical actinfo I ^ n + in (34) is computed parametrically, using an estimate (43) of the target probability Q ( A ) . Then, I ^ n + is an asymptotically normal estimator of I + ( θ * ) , in the sense that
n I ^ n + I + ( θ * ) L N ( 0 , V ) as n ,
where the variance of the limiting normal distribution is given by
V = Cov P θ * 2 f ( X ) I ( f ( X ) f 0 ) Var Q f ( X ) P θ * 2 ( A ) Var P θ * 2 [ f ( X ) ] .
Moreover, the significance level of the parametric test for fine-tuning, based on (36) and (43), satisfies
lim n log P H 0 I ^ n + I min n = C ,
for
C = sup ϕ > 0 ϕ E P min [ f ( X ) ] log M ( ϕ ) ,
where P min = P θ min , θ min < θ * is the solution of P θ min ( A ) = p min = P 0 ( A ) exp ( I min ) , M ( ϕ ) is given by (11), whereas p min is defined in (33).

4.1.3. Comparison between Nonparametric and Parametric Estimates of Actinfo

The two versions of empirical actinfo are complementary. The nonparametric version is preferable in the sense that it makes less assumptions about the distribution P of the random algorithm under H 1 , and in particular, it is a consistent estimator of I Q + in (35). The parametric version of I ^ + , on the other hand, is preferable when n Q ( A ) is small, since it makes use of all data in order to estimate Q ( A ) , although it is not a consistent estimator of I Q + when Q P . The asymptotic variances in (39) and (47), as well as the rates of exponential significance level decrease in (41) and (49), agree when Q = P θ * and f ( x ) = f 0 1 { x A } , which is a special case of (8).

4.2. Null Distribution Unknown

Suppose that the null distribution P 0 = P 0 ξ involves an unknown nuisance parameter ξ Ξ . The objective is then to test the two hypotheses
H 0 : Q P 0 , H 1 : Q P 1 ,
where the set of distribution under the null and alternative hypotheses equals
P 0 = { P 0 ξ ; ξ Ξ }
and (33), respectively.

4.2.1. One Sample Available

The actinfo
I Q + = I Q + ( ξ ) = log Q ( A ) P 0 ξ ( A )
cannot be consistently estimated if only one sample (31) is available. The best course of action is thus to estimate a lower bound
I ^ + = I ^ n + = log Q ^ ( A ) P 0 max ( A )
of I + , with P 0 max ( A ) defined in (4) and Q ^ ( A ) an estimate of Q ( A ) . This estimator will have an asymptotic bias
B = I Q + ( ξ * ) I Q + = log P 0 ξ ( A ) P 0 max ( A ) 0 ,
where ξ * is the nuisance parameter that maximizes P 0 ξ ( A ) [43]. For the numerator of (53), either the nonparametric estimate of Q ( A ) in (37) can be used, or a parametric class
P = { P θ ξ ; θ Θ , ξ Ξ }
of distributions can be used that involves a tuning parameter vector θ and a vector of nuisance parameters ξ . If Q is thought to be close to P , the parametric estimate
Q ^ ( A ) = P θ ^ ξ ^ ( A )
of Q ( A ) is used, which generalizes (43), with
( θ ^ , ξ ^ ) = arg max θ , ξ i = 1 n log P θ ξ ( X i ) .
When the sample size n tends towards infinity, the estimator (56) will converge to
( θ * , ξ * ) = arg min θ , ξ D K L ( Q P θ ξ ) .
The following result is an extension of Propositions 4 and 5, when nuisance parameters ξ are added and a general type of tuning parameter θ (not necessarily a scalar tilting parameter) is used. A short proof of the proposition is offered in Section 7.
Proposition 6.
Suppose that the null distribution P 0 = P 0 ξ involves an unknown parameter ξ and the actinfo I Q + in (52) is estimated by I ^ n + in (53), using an estimator Q ^ ( A ) of the target probability Q ( A ) that is either nonparametric (37) or parametric (55). Given these assumptions, I ^ n + is an asymptotically normal estimator, in the sense that
n ( I ^ n + I Q + B ) L N ( 0 , V ) as n .
The asymptotic bias B in (58) is defined in (54) whereas the asymptotic variance V is defined in (39) for the nonparametric estimator of I Q + , whereas
V = E [ ψ θ * ξ * ( X ) | X A ] E [ ψ θ * ξ * ( X ) ] 1 E [ ψ θ * ξ * T ( X ) ψ θ * ξ * ( X ) ] · E [ ( ψ θ * ξ * ) T ( X ) ] 1 E [ ψ θ * ξ * ( X ) | X A ] T
for the parametric estimator of I Q + , with ψ θ ξ ( x ) = d log P θ ξ ( x ) / d ( θ , ξ ) , ( θ * , ξ * ) defined as in (57), and T refering to matrix transposition. Moreover, the significance level of the test (36) of FT, with threshold I min , satisfies
lim n log P 0 ξ I ^ n + I min n = C ,
with
C = p min e B log p min e B P 0 ξ ( A ) + ( 1 p min e B ) log 1 p min e B 1 P 0 ξ ( A )
for the nonparametric version of the test, with p min = P 0 ξ ( A ) exp ( I min ) . For the parametric versions of the FT-test, and in the special case when θ is a scalar exponential tilting parameter, C is given by (49), with P min = P θ m i n ξ , and θ min the solution of P θ m i n ξ ( A ) = p min e B .
Remark 2.
The negative bias term B makes the test of FT in Proposition 6 more conservative than the tests in Propositions 4 and 5. This can be seen, for instance, by comparing the two large deviation rates C in (41) and (61). The rate in (61) is larger, since p min is multiplied by a term e B . This corresponds to the fact that to falsely reject H 0 in Proposition 6 is more difficult.

4.2.2. Two Samples Available

In addition to the first sample (31), suppose a second sample
X 01 , , X 0 n 0 P 0 ξ
of n 0 i.i.d. observations under the null distribution is available. A consistent estimator
I ^ + = I ^ n n 0 + = log Q ^ ( A ) P 0 ξ ^ ( A )
of I Q + in (52) is then available, with
ξ ^ = arg max ξ i = 1 n 0 log P 0 ξ ( X 0 i ) .
The following result provides asymptotic properties of the estimator (63) of actinfo, and the corresponding test (36) of FT with threshold I min (cf. Section 7 for a proof):
Proposition 7.
Suppose that the null distribution P 0 = P 0 ξ involves an unknown nuisance parameter ξ, and that the active information I Q + in (52) is estimated by I ^ n n 0 + in (63), making use of two samples (31) and (62), of sizes n and n 0 , from Q and P 0 ξ , respectively. Further assume that the estimator Q ^ ( A ) of Q ( A ) is either nonparametric (37) or parametric (55). If n , n 0 in such a way that
n n 0 λ > 0 ,
then
n ( I ^ n n 0 + I Q + ) L N ( 0 , V 1 + λ V 2 ) ,
where
V 2 = E [ ψ ξ ( X ) | X A ] E [ ψ ξ T ( X ) ψ ξ ( X ) ] 1 E [ ψ ξ ( X ) | X A ] T ,
and ψ ξ ( x ) = d log P 0 ξ ( x ) / d ξ . If the nonparametric estimator of Q ( A ) is used, then V 1 equals V in (39), whereas if the parametric estimator Q ( A ) is used, then V 1 equals V in (59). The significance level of the test (36) of FT, with threshold I min , satisfies the same type of large deviation result (60) as in Proposition 6, for the nonparametric and parametric versions of the test (in the latter case assuming that θ is a scalar tilting parameter), but in the definitions of the nonparametric and parametric large deviation rates C, the bias term B = 0 .

5. Examples

In this section, we provide five examples. The first cosmology example is a continuation of Section 1.1, with specificity corresponding to a universe that permits life. The second example of student learning was introduced in Section 1.2, with specificity being the test score of a student who prepares for a test. The third example concerns reinforcement learning, with specificity the cumulative reward of a certain trajectory of actions and environments. The last two examples concern evolutionary algorithms for generating molecular machines, with specificity corresponding to the functionality or fitness of these machines. These evolutionary algorithms can be viewed as extensions or variants of the Metropolis–Hastings algorithms of Section 3.2, where proposed moves correspond to mutations, whereas accepted moves correspond to mutations that survive and then possibly spread to a whole population.
Example 1
(Cosmology [26,27]). Suppose that there is a positive constant of nature X Ω = R + , a life-permitting interval A Ω , and a specificity function (6) that equals 1 inside A = ( a , b ) and zero elsewhere. The maximum entropy distribution under a first moment constraint ξ = E ( X ) is exponential with expected value. Consequently,
P 0 ξ ( A ) = 1 ξ a b e x / ξ d x .
The null and alternative hypotheses for the fine-tuning test are given in (50), where under H 1 , the agent brings about a life-permitting value of X with probability 1 ( P ( A ) = 1 ). Only one universe is observed, with a value X = X 1 of the constant. Therefore, there is a sample (31) of size n = 1 , whereas no null sample (62) is available. Since X 1 A is life-permitting, Q ^ ( A ) = 1 . The estimate (53) of actinfo then simplifies to
I ^ + = log 1 P 0 max ( A ) = log P 0 max ( A ) .
Let x = ( a + b ) / 2 be the midpoint of the LPI and suppose that half of its relative size ϵ = ( b a ) / ( 2 x ) is small. The probability in (68) is then approximated by
P 0 max ( A ) ( b a ) max ξ > 0 e x / ξ ξ 2 ϵ e 1 .
From (68), the estimated actinfo
I ^ + 1 log ( ϵ ) log ( 2 )
is a monotone decreasing function of ϵ .
Example 2
(Evaluation of student test scores [44]). As a generalization of the example given in Section 1.2, suppose that a number of students perform a test. Let x = ( z , y ) = ( z 1 , , z d 1 , y ) R d summarize the chararcteristics of a student with covariates z that are used to predict the outcome y of the test. The specificity function f ( x ) = x d = y equals the student’s test score, and (5) corresponds to the set of students that pass the test, with a minimally allowed score of f 0 . The population of students follows a ( d 1 ) -dimensional multivariate normal distribution Z N ( , Σ ) , where = ( m 1 , , m d 1 ) and Σ = ( σ j k ) j , k = 1 d 1 are known. The conditional distribution of the response follows a multiple linear regression model
Y | Z = z N ξ 0 + j = 1 d 1 ξ j z j + t ( θ 0 + j = 1 d 1 θ j z j ) , σ 2 ,
for a student with a covariate vector z who prepared for the test for a period of length t. The nuisance parameter vector ξ = ( ξ 0 , , ξ d 1 , σ 2 ) involves the error variance and the regression parameters for students who did not train for the test, whereas the tuning parameter vector θ = ( θ 0 , , θ d 1 ) involves the regression parameters that correspond to the effect of preparing for the test. The unconditional distribution of the response is normal, Y N ( μ , V ) , with
μ = μ ( θ , ξ , t ) = ( ξ 0 + t θ 0 ) + j = 1 d 1 ( ξ j + t θ j ) m j , V = V ( θ , ξ , t ) = σ 2 + j , k = 1 d 1 ( ξ j + t θ j ) ( ξ k + t θ k ) σ j k .
Therefore, the probability that a randomly chosen student that studied for the test for a period of length t passes is
P ( A ) = P θ ξ t ( A ) = P ( Y f 0 ) = 1 Φ f 0 μ V ,
where Φ is the cumulative distribution function of a standard normal distribution. The null distribution P 0 = P 0 ξ corresponds to putting t = 0 in (69). Thus, the actinfo
I + = I + ( θ , ξ , t ) = log 1 Φ ( f 0 μ ( θ , ξ , t ) / V ( θ , ξ , t ) 1 Φ ( f 0 μ ( 0 , ξ , 0 ) / V ( 0 , ξ , 0 )
quantifies how much learning, during a period of length t, increases the probability of passing the test. To compute an estimate I ^ + of I + in (70), estimates ξ ^ and θ ^ of ξ and θ are needed. This can be achieved by collecting two training samples, as in (63). Another option is to compute the least squares estimates ( ξ ^ , θ ^ ) of the nuisance and the tuning parameters jointly, without bias, from one single dataset { ( t i , z i , y i ) } i = 1 n , provided that the time periods t i vary, so that all parameters are identifiable.
Example 3
(Reinforcement learning (RI) [45]). Consider an agent whose purpose is to maximize the reward f ( x ) of a trajectory x that they to some extent will be able to control, for a time period of length t. At each time point u, there are m possible environments S = { s 1 , , s m } and q possible actions A = { a 1 , , a q } to take. The state space X = A t × S t + 1 consists of all possible trajectories
x = ( a 0 , , a t 1 , s 0 , , s t )
of environments and actions, where s u is the environment and a u the action taken at time u. A corresponding random trajectory is denoted with capital letters
X = ( A 0 , , A t 1 , S 0 , , S t ) .
If the environment of the system is S u = s at time u, and action A u = a is taken, the probability of moving to environment s is P a ( s , s ) = P ( S u + 1 = s | S u = s , A u = u ) , with an instantaneous reward of R a ( s , s ) . If future rewards are discounted by a factor γ , the total reward, over a time horizon of length t, is
f ( x ) = u = 0 t R a u ( s u , s u + 1 ) γ u .
Let f 0 be a lower bound for a trajectory’s total discounted reward to be acceptable, so that A in (5) is the set of all acceptable trajectories. The agent takes action according to some policy to make the expected total reward of a trajectory as large as possible. To this end, consider stationary policies, where the action A u taken by the agent at each time point u is only determined by the current environment s u , according to some matrix Π = ( π ( s , a ) ; s S , a A ) of transition probabilities π ( s , a ) = P ( A u = u | S u = s ) . For a completely random policy
π ( s , a ) = ξ a ; a = 1 , , q ,
the action is not influenced by the current environment, and it is completely specified by the vector ξ = ( ξ 1 , , ξ q ) of nuisance parameters. Thus, P 0 ( A ) = P 0 ξ t ( f ( X ) f 0 ) is the probability that an ignorant agent with policy determined by ξ , will have an acceptable trajectory. An agent who knows the reward function R a and the dynamics P a of the environment will try to take this knowledge into account to formulate a policy that makes the reward as large as possible. A deterministic policy θ : S A is a function that takes a unique action for each environment, so that
π ( s , a ) = 1 { a = θ ( s ) } .
Thus, P ( A ) = P θ t ( f ( X ) f 0 ) is the probability that an agent with deterministic policy θ obtains an acceptable trajectory. The active information
I + = I + ( θ , ξ , t ) = log P θ ( u = 0 t R A u ( S u , S u + 1 ) γ u f 0 ) P 0 ξ ( u = 0 t R A u ( S u , S u + 1 ) γ u f 0 )
quantifies, on a logarithmic scale, how much more likely it is for an agent with policy θ to obtain an acceptable trajectory, compared to an ignorant agent with policy ξ . The values ξ and θ are varied during the exploration phase of RI, but they are assumed to be known during the exploitation phase of RI. Suppose that we want to compute the actinfo (71) during the exploitation phase. Since P 0 ( A ) and P ( A ) are typically unknown, they have to be estimated by Monte Carlo. To this end, assume we have two samples (31) and (62) of n and n 0 trajectories available, from Q = P θ t and Q = P 0 ξ t , respectively. Then, I ^ + in (63) can be used to estimate the actinfo (71).
Example 4
(Molecular machines and Moran models [15,30,41]). Suppose that Ω consists of all 2 d binary sequences x = ( x 1 , , x d ) of length d, with a null distribution P 0 ( x ) that will be chosen below. The specificity function f is defined as
f ( x ) = a | x | , x ( 1 , , 1 ) , 1 , x = ( 1 , , 1 ) ,
where | x | = i = 1 d x i and a 1 / d is a fixed parameter. We regard x as a molecular machine with d parts, with x i = 1 or 0 depending on whether part i functions or not. The specificity f ( x ) quantifies how well the machine works, for instance, its ability to regulate activity in vitro or in vivo in a living cell. It is assumed that f ( x ) is determined by the number | x | of functioning parts, with a maximal value f max = f ( 1 , , 1 ) = 1 . Using (8), the most stringent definition of high specificity, it follows that A = { ( 1 , , 1 ) } only contains one element, a molecular machine for which all parts are in shape. The parameter a is crucial. If 0 < a 1 / d , it follows that a molecular machine works better the more the parts that are in shape. On the other hand, if a < 0 , then a molecular machine with some parts in shape, but not all, functions worse the more parts are in shape, since all units must work in order for the whole machine to function, and there is a cost a associated with carrying each part that is in shape, as long as the whole system does not function.
Each state x is interpreted as a population of N subjects, all having the same variant x of the molecular machine. With this interpretation, X = X t is the outcome of a random evolutionary process where all subjects of the population, at any time point t, have the same state. However, this state may vary over time when all subjects of population simultaneously experience the same change. The question of interest is whether this process can modify the population so that all its members have a functioning molecular machine. A transition of this process from x is caused by a mutation with distribution q ( x , · ) , where q ( x , x ) = 0 . Suppose a mutation from x to y is possible, i.e., q ( x , y ) > 0 . A mutation from x to y first occurs in one individual and then it either (momentarily) dies out with probability 1 α θ ( x , y ) or it (momentarily) spreads to the whole population (becomes fixed) with probability
α θ ( x , y ) = C · e θ f ( y ) P 0 ( y ) q ( y , x ) e θ f ( x ) P 0 ( x ) q ( x , y ) 1 / 2 ,
where
C = max x , y e θ f ( y ) P 0 ( y ) q ( y , x ) e θ f ( x ) P 0 ( x ) q ( x , y ) 1 / 2
is a constant assuring that (73) never exceeds 1, and the maximum is taken over all x , y such that x y and both of q ( x , y ) and q ( y , x ) are positive. The Markov chain with transition probabilities (14) and acceptance probability (73) represent the dynamics of the evolutionary process.
As shown in Section 7, the equilibrium distribution of this Markov chain is given by P θ in (10). In particular, Propositions 2 and 3 remain valid when the Markov chain (14) with acceptance probabilities (73) are used, rather than (15). We will interpret
s ( x ) = e θ f ( x ) / N
as the selection coefficient or fitness of individuals with a molecular machine of type x, that is, s ( x ) is proportional to the fertility rate of individuals of type x.
The MH-type Markov chain with acceptance probability (73) and (74) represents an evolutionary process that closely resembles a Moran model with the selection [46,47,48], which is frequently used for describing evolutionary processes (as can be seen in Section 7). The Moran model is a continuous time Markov chain for a population with overlapping generations where individuals die at the same rate, and are replaced by the offspring of individuals in the population proportionally to their selection coefficients s ( x ) . New types arise when an offspring of parents of type x mutate with probability μ ( x ) . If the mutation rate is small ( μ ( x ) N 1 for all x Ω ), then to a good approximation the whole population will have the same type at any point in time, which is a so-called fixed state assumption.
Even though the Moran model is specified in continuous time, time can be discretized as t = 0 , 1 , 2 , by only recording the population when individuals die. If individuals die at a rate of 1, then the next individual dies at a rate of N, so that time is counted in units of N 1 generations. The fixed state assumption is motivated by assuming that newborn offspring with a new mutation either dies out or spreads to the whole population (becoming fixed in the population) right after birth. In this context, q corresponds to the way in which mutations change the type of the individual, whereas α θ = α θ N is the probability of fixation. If q ( x , y ) is the conditional probability that an offspring of a type x parent mutates to y, given that a mutation occurs, then the proposal kernel of the Moran model is
q Moran ( x , y ) = μ ( x ) q ( x , y ) , x y , 1 μ ( x ) , x = y .
As shown in Section 7, the acceptance (or fixation) probability of the Moran model is
α θ N Moran ( x , y ) 1 N 1 + θ [ f ( y ) f ( x ) ] 2 1 N e θ f ( y ) e θ f ( x ) 1 / 2
when θ [ f ( y ) f ( x ) ] is small. From (76) and (77), the Moran model approximates the Metropolis–Hastings kernel with acceptance probabilities (73) and (74) with good accuracy when (i) μ ( x ) μ ; (ii) P 0 is uniform; and (iii) the proposal kernel q is symmetric (i.e., q ( x , y ) = q ( y , x ) ), although the time scales of the two processes are different. More specifically, if (i)–(iii) hold, a time-shifted version of the Moran model approximates the MH-type model with acceptance probabilities (73) and (74), so that each time step of the MH-type Markov chain corresponds to C / μ generations of a Moran model. However, even under assumptions (i)–(iii), the stationary distribution of the Moran model differs slightly from P θ .
The proposal kernel q ( x , y ) is assumed to be local and satisfying
q ( x , y ) = b / [ | x | + b ( d | x | ) ] , y = x + e j , x j = 0 , 1 / [ | x | + b ( d | x | ) ] , y = x + e j , x j = 1 , 0 , otherwise ,
where e j = ( 0 , , 0 , 1 , 0 , , 0 ) is a row vector of length d with a 1 in position j { 1 , , d } and zeros elsewhere, whereas x + e j refers to component-wise addition modulo 2, corresponding to a switch of component j of x. A change of component j from 0 to 1 is caused by a beneficial mutation, whereas a change from 1 to 0 corresponds to a deleterious mutation. Consequently, b > 0 is the ratio between the rates at which beneficial and deleterious mutations occur.
The kernel q in (78) is symmetric only when beneficial and deleterious mutations have the same rate ( b = 1 ). The more general case of asymmetric q is handled differently by the MH-type algorithm and the Moran model. Whereas the MH-type algorithm elevates the acceptance probability (73) of seldom-proposed states y (those y for which q ( x , y ) is small for many x), this is not the case for the acceptance probability (77) of the Moran model. To avoid that these states y are reached too often by the MH-type algorithm, the null distribution P 0 of no selection has to be chosen so that P 0 ( y ) is small for rarely proposed states (whereas the Moran model needs no such correction). Therefore P 0 in (73) will be chosen as the stationary distribution of a transition kernel (14) for which θ = 0 and all candidates are accepted ( α 0 ( x , y ) = 1 ). That is, if Π ˜ 0 refers to the transition matrix of such a Markov chain, the initial distribution P 0 in (17) is chosen as the solution of
P 0 = P 0 Π ˜ 0 , x Ω P 0 ( x ) = 1 .
The null distribution P 0 = P 0 b in (79) involves one single nuisance parameter ξ = b . In the special case, when beneficial and deleterious mutations have the same rate ( b = 1 ), this procedure generates a uniform distribution P 0 ( x ) 2 d . On the other hand, states x with many functioning parts will be harder to reach by the Markov process Π ˜ 0 when beneficial mutations occur less frequently than deleterious ones ( 0 < b < 1 ), resulting in smaller values of P 0 ( x ) . The distribution under the alternative hypothesis, P = P θ ˜ b t , involves the nuisance parameter b, the time point t at which the state of the population is recorded, and θ ˜ = ( a , θ ) , the two parameters that determine how much background information the MH-type evolutionary algorithm makes use of. For simplicity, a and b are here regarded as constants and we only include θ and t in the notation. This gives rise to an active information
I + ( θ , t ) = log P θ X t = ( 1 , , 1 ) P 0 X t = ( 1 , , 1 ) .
The MH-type algorithm is studied for d = 5 , and illustrated in Figure 1, Figure 2 and Figure 3. Note that the functional information I f 0 is a decreasing function of b, since it is more surprising to find a working molecular machine by chance when the rate of beneficial mutations b is small. Moreover, the active information I + ( θ ) = lim t I + ( θ , t ) for the equilibrium distribution of the Markov chain as well as the active information I + ( θ , t ) and I s + ( θ , t ) for a system in non-equilibrium, without and with stopping, are increasing functions of θ , and decreasing functions of a and b. The smaller a or b is, the more external information can be infused to increase the probability of reaching the fine-tuned state of a working molecular machine ( 1 , , 1 ) . When a is small, to leave this state once it is reached becomes more difficult, and consequently I s + ( θ , t ) , is only marginally larger than I ( θ , t ) .
Example 5
(Evolutionary programming algorithms). Suppose that Ω = Ω ind N is a set of genetic variants from some genomic region, x = ( x 1 , , x N ) , for the members of a population of size N. That is, x k Ω ind is the variant of this genomic region for individual k. If, for instance, the region codes for the molecular machine of Example 4, we let x k = ( x k 1 , , x k d ) { 0 , 1 } d = Ω ind , with x k j = 1 or 0 depending on whether component j of this machine works for individual k. Let g ( x k ) be the biological fitness, or the expected number of offspring, of k. In the context of molecular machines, the logarithm of g ( x k ) could be a function of the number of functioning parts of a machine of type x k . The specificity function of a population in state x is the average fitness
f ( x ) = 1 N k = 1 N g ( x k )
of its individuals. The targeted set A in (5) corresponds to all genetic profiles with an average fitness at least f 0 . This type of model is frequently used in genetic programming as well as in other types of evolutionary programming algorithms to mimic the evolution of N individuals over time [49,50]. Typically, the output X = X t of the evolutionary algorithm is the last step of a simulation { X s = ( X s 1 , , X s N ) } s = 0 t of the population over t generations. Once the distributions P 0 = P 0 ξ t and P = P θ ξ t of X are found under the null hypothesis H 0 and the alternative hypothesis H 1 , the actinfo I + can be computed, according to (1). This actinfo quantifies, on a logarithmic scale, how much more likely it is for the average fitness of the population to exceed f 0 at time t, for a population with externally infused information ( H 1 ) compared to an evolutionary process where no such external information is used ( H 0 ). For instance, if a molecular machine needs all its parts in order to function ( g ( x k ) = 1 ( | x k | = d ) ), then the actinfo at time t equals
I + = I + ( θ , ξ , t ) = log P θ ξ t | { k ; 1 k N , X k = ( 1 , , 1 ) } | f 0 N P 0 ξ t | { k ; 1 k N , X k = ( 1 , , 1 ) } | f 0 N ,
with X = ( X 1 , , X N ) . Since the state space Ω is very large, it is often complicated to find explicit, analytical expressions for the actinfo I + in (81). Suppose that the nuisance parameters ξ of the null distribution P 0 = P 0 ξ are known. This makes the framework of Section 4.1 applicable, running the evolutionary algorithm n times. That is, n i.i.d. copies { X i s } s = 0 t of the population trajectory are generated up to time t for i = 1 , , n . Then, X i = X i t = ( X i t 1 , , X i t N ) , i = 1 , , n , are used for computing an estimate I ^ n + of the actinfo, and test for fine-tuning, according to Section 4.1.
Recall the fixed state assumption of Example 4, whereby all individuals of the population, at any time point, have the same state. Such an assumption is only realistic when N μ 1 , that is, when either the mutation rate μ and/or the population size N is small. This corresponds to a scenario where P 0 and P put all their probability masses along the diagonal
Ω diag = { x Ω ; x 1 = = x N }
of Ω . Since (82) is equivalent to the reduced state space Ω ind , the fixed state assumption greatly simplifies the analysis. For instance, it often makes it possible to find analytical expressions for the actinfo I + , rather than having to estimate it.

6. Discussion

In this article, a general statistical framework is provided for using active information to quantify the amount of pre-specified external knowledge an algorithm makes use of, or equivalently, how tuned the algorithm is. The theory is based on quantifying, for each state x, how specified it is by means of a real-valued function f ( x ) . An algorithm with external information either directly makes use of knowledge of f, or at least it incorporates knowledge that tends to move the output of the algorithm towards more specified regions. The Metropolis–Hastings Markov chain directly incorporates knowledge of f in terms of the acceptance probability of proposed moves. The learning ability of this algorithm was analyzed by studying its active information, with or without stopping, when the targeted set of highly specified states is reached. When the independent outcomes of an algorithm are available, nonparametric and parametric estimators of the actinfo of the algorithm were also developed, as well as nonparametric and parametric tests of FT.
This work can be extended in different ways. A first extension is to find conditions under which the actinfo I + ( θ , t ) of a stochastic algorithm based on a random start (according to the null distribution of a non-guided algorithm) followed by t iterations of the Metropolis–Hastings Markov chain (without stopping) is a non-decreasing function of t. We conjecture that this is typically the case but have not obtained any general conditions on the distribution q of proposed candidates for this result to hold.
A second extension is to widen the notion of specificity, so that not only the functionality f ( x ) but also the rarity P 0 ( x ) of the outcome x under the null distribution is taken into account. A class of such specificity functions is
g θ ( x ) = θ f ( x ) log P 0 ( x ) ,
where θ > 0 is a parameter that controls the tradeoff between scenarios where either functionality or rarity under the null is the most important determinant of specificity. The case θ = 0 in (83) corresponds to the function having no impact, so that g 0 ( x ) reduces to Shannon’s self information of x. The case g 1 ( x ) was proposed in [15], whereas g θ ( x ) is solely determined by f ( x ) in the limit when θ becomes large.
A third extension is to generalize the notion of actinfo to include not only the probability of reaching a targeted set of highly specified states A under H 0 and H 1 , but also account for the conditional distribution of the states within A, given that A has been reached. This is related to the way in which functional sequence complexity generalizes the functional information [51,52,53,54]. Let H ( Q ) = x Q ( x ) log [ Q ( x ) ] refer to the Shannon entropy of a distribution Q, whereas H ( Q A ) is the Shannon entropy of the corresponding conditional distribution Q A ( x ) = Q ( x | A ) , given that A has been reached. The functional sequence complexity
FSC 0 = H ( P 0 ) H ( P 0 A ) = E P 0 log [ P 0 ( X A ) ] X A E P 0 { log [ P 0 ( X ) ] }
is the reduction in entropy, under the null hypothesis H 0 of the highly specified states in A, compared to the entropy under H 0 of all states in Ω . FSC 0 then reduces to the functional information I f 0 when P 0 is uniform over Ω . In a similar vein, the active uncertainty reduction is introduced:
UR + = x A P A ( x ) log P ( x ) x A P 0 A ( x ) log P 0 ( x ) = E P [ log P ( X ) | X A ] E P 0 [ log P 0 ( X ) | X A ] .
Then, UR + = I + when P 0 A and P A are uniformly distributed on A. This happens, for instance, when P 0 has a uniform distribution on Ω and P = P θ for some θ > 0 , and if (8) holds. The properties of U R + deserve to be analyzed in more detail, for instance, by investigating how it differs from the actinfo I + .
A fourth extension would be to apply the concept of actinfo to other genetic models. For instance, Example 4 is the first time that, to our knowledge, actinfo is applied to the Moran model. In the past, however, actinfo was used in population genetics to study fixation times for the Wright–Fisher model of population genetics, a model for which time is discrete and generations do not overlap [55].

7. Proofs

Proof of Proposition 1.
Introduce
J ( θ ) = x A c exp { θ [ f ( x ) f ( x 0 ) ] } P 0 ( x ) , K ( θ ) = x A exp { θ [ f ( x ) f ( x 0 ) ] } P 0 ( x ) ,
when Ω is countable, and replace the sums in (84) by integrals when Ω is continuous. Then
P θ ( A ) = exp [ θ f ( x 0 ) ] K ( θ ) / { exp ( θ f ( x 0 ) ) [ J ( θ ) + K ( θ ) ] } = K ( θ ) / [ J ( θ ) + K ( θ ) ] = 1 / [ J ( θ ) / K ( θ ) + 1 ] .
Since P 0 ( A ) < 1 , it follows that J ( θ ) is a strictly decreasing function of θ 0 , whereas K ( θ ) is a non-decreasing function of θ . From this, it follows that P θ ( A ) is a strictly increasing function of θ , and consequently I + ( θ ) = log [ P θ ( A ) / P 0 ( A ) ] is a strictly increasing function of θ as well.
Moreover, K ( θ ) P 0 ( A ) > 0 for all θ 0 , and J ( θ ) 0 as θ follows by dominated convergence. In conjunction with (85), this implies that P θ ( A ) 1 and I + ( θ ) I f 0 as θ . □
Proof of Proposition 2.
Equation (20) follows from (17), (19) and the fact that
P 0 ( A ) = x A P 0 ( x ) = P 0 v , P θ t ( A ) = x A P θ t ( x ) = P θ t v s . = P 0 Π θ t v ,
since v is a column vector of length | Ω | with ones in positions x A and zeros in positions x A c .
Equation (21) is equivalent to proving that
P θ t ( A ) P θ ( A ) as t .
However, this follows from the fact that P θ is the equilibrium distribution of the Markov chain with transition kernel (18). That is, letting t in (19), we find that
P θ t = P 0 Π θ t P θ ,
and therefore
P θ t ( A ) = P θ t v s . P θ v s . = P θ ( A ) , as t .
Proof of Proposition 3.
Equation (28) follows from the definitions of I + ( θ , t ) and I s + ( θ , t ) in (20) and (27), and the fact that
P θ t ( A ) = P ( X t A ) P ( X t T A ) = P θ t s ( A ) ,
where the inequality is a consequence of the definition of T in (22). Since
P θ t s ( A ) = P ( T t ) P ( T t + 1 ) = P θ , t + 1 , s ( A ) ,
we proved that I s + ( θ , t ) is non-decreasing in t. Equation (29) follows from the definition of I s + ( θ , t ) and the fact that
lim t P θ t s ( A ) = P ( T < ) = 1 .
The last equality of (86) is a consequence of the fact that the Markov chain with transition kernel Π θ is irreducible, so that any state x Ω will be reached with a probability of 1. In particular, the targeted set A will be reached with a probability of 1. In order to verify (30), we first deduce
P ( T > t ) = 1 P 0 ( A ) e I s + ( θ , t )
from (24), and then we make use of the equality
E ( T ) = t = 0 P ( T > t ) .
Proof of Proposition 4.
Since n Q ^ ( A ) Bin ( n , Q ( A ) ) has a binomial distribution, it follows from the central limit theorem that
n ( Q ^ ( A ) Q ( A ) ) L N ( 0 , Q ( A ) [ 1 Q ( A ) ] ) ,
as n . Notice that I ^ + = g ( Q ^ ( A ) ) , where g ( Q ) = log [ Q / P 0 ( A ) ] and g ( Q ) = 1 / Q . Equation (38) follows from the Delta method (see, e.g., Theorem 8.12 of [33]) and the fact that
V = g ( Q ( A ) ) 2 · Q ( A ) [ 1 Q ( A ) ] .
In order to establish (40), to begin with, it follows from (34) and the definition of p min that
P H 0 ( I ^ + I min ) = P H 0 ( Q ^ ( A ) p min ) = P H 0 1 n i = 1 n Y i p min ,
where Y i = I ( X i A ) B e ( p 0 ) are independent Bernoulli variables under H 0 with success probability p 0 = P 0 ( A ) . It follows from the large deviations theory that (40) holds, with
C = sup ϕ > 0 [ ϕ p min λ ( ϕ ) ]
the Legendre–Fenchel transformation, and
λ ( ϕ ) = log E [ exp ( ϕ Y ) ] = log [ 1 + p 0 ( e ϕ 1 ) ]
the cumulant generating function of Y [56], pp. 529–533. Inserting (89) into (88), it can be seen that the maximum in (88) is given by (41). □
Proof of Proposition 5.
In order to verify (46), we will first show that the estimator (42) of the tilting parameter θ is asymptotically normal
n ( θ ^ n θ * ) L N ( 0 , U ) a s n ,
with asymptotic variance
U = Var Q [ f ( X ) ] Var P θ * 2 [ f ( X ) ] .
To this end, let refer to the derivatives with respect to the tilting parameter θ . Define the score function
ψ θ ( x ) = d log P θ ( x ) d θ = P θ ( x ) P θ ( x )
and its derivative
ψ θ ( x ) = d ψ θ ( x ) d θ .
It is a standard result from the asymptotic theory of maximum likelihood estimation and M-estimation (see, e.g., Chapter 6 of [33]) that (90) holds with asymptotic variance
U = Var Q [ ψ θ * ( X ) ] E Q 2 [ ψ θ * ( X ) ] .
To simplify (92), notice that the score function can be written as
ψ θ ( x ) = f ( x ) M ( θ ) M ( θ ) = f ( x ) E P θ [ f ( X ) ]
for the exponential family of tilted distributions (10) and (11). From this, it follows that
ψ θ ( x ) = M ( θ ) M ( θ ) M ( θ ) M ( θ ) 2 = Var P θ [ f ( X ) ]
is a constant, not depending on x. Inserting the last two displayed equations into (92), the formula in (91) for the asymptotic variance of θ ^ is obtained. As a next step, we notice that
I ^ + = g ( θ ^ ) ,
where
g ( θ ) = log P θ ( A ) P 0 ( A ) = log h ( θ ) log P 0 ( A ) ,
and
h ( θ ) = P θ ( A ) = x A e θ f ( x ) P 0 ( x ) d x M ( θ )
follows from the definition of P θ ( x ) in (10).
Differentiating (96) with respect to θ , we find that
h ( θ ) = x A f ( x ) e θ f ( x ) P 0 ( x ) d x / M ( θ ) M ( θ ) x A e θ f ( x ) P 0 ( x ) d x / M 2 ( θ ) .
Furthermore, it follows from the RHS of (97) that
h ( θ ) = E P θ [ f ( X ) I ( f ( X ) f 0 ) ] P θ ( A ) E P θ [ f ( X ) ] = Cov P θ [ f ( X ) , I ( f ( X ) f 0 ) ] .
Then, we combine (95) and (97), and obtain
g ( θ ) = h ( θ ) h ( θ ) = Cov P θ [ f ( X ) , I ( f ( X ) f 0 ) ] P θ ( A ) .
Finally, we use the Delta method to conclude that I ^ + is an asymptotic normal estimator (38) of I + ( θ * ) , with asymptotic variance V = g ( θ * ) 2 U , which, in view of (91) and (99), agrees with (47).
In order to prove the large deviation result (48) for the parametric test of FT, let θ min be the value of the tilting parameter that satisfies P θ min ( A ) = p min = P 0 ( A ) exp ( I min ) . Then, notice that
P H 0 ( I ^ + I min ) = P H 0 ( Q ^ ( A ) p min ) = P H 0 ( θ ^ θ min ) = P H 0 ( i = 1 n ψ θ min ( X i ) / n 0 ) = P H 0 i = 1 n f ( X i ) / n E p min [ f ( X ) ] ,
where, in the third step, we utilized that θ ^ θ min is equivalent to the derivative of the log likelihood of data being non-negative at θ min , and in the fourth step, we made use of (93) and introduced p min = P θ min . However, this last line is a large deviations probability. It then follows from a large deviations theory that (48) holds, with C the Legendre–Fenchel transformation in (49). □
Proof of Proposition 6.
Since the bias corrected empirical actinfo
I ^ n + B = log Q ^ ( A ) P 0 ξ ( A )
behaves like (34), with P 0 = P 0 ξ , the asymptotic normality result for the nonparametric version of the estimator of I Q + follows from Proposition 4.
For the parametric version of the estimator of I Q + , we will (briefly) generalize the asymptotic normality proof of Proposition 5. It follows from (53) and (55) that
I ^ n + = g ( θ ^ , ξ ^ ) ,
where
g ( θ , ξ ) = log P θ ξ ( A ) P 0 max ( A ) .
Making use of the delta method, it follows that the asymptotic variance of the parametric version of I ^ n + equals
V = g ( θ * , ξ * ) AsVar ( θ ^ , ξ ^ ) g ( θ * , ξ * ) T ,
with the asymptotic variance of ( θ ^ , ξ ^ ) defined through
n ( θ ^ , ξ ^ ) ( θ * , ξ * ) L N ( 0 , AsVar ( θ ^ , ξ ^ ) )
as n . Since ( θ ^ , ξ ^ ) in (56) is an M-estimator, it follows that its asymptotic variance equals
AsVar ( θ ^ , ξ ^ ) = E [ ψ θ * ξ * ( X ) ] 1 E [ ψ θ * ξ * T ( X ) ψ θ * ξ * ( X ) ] E [ ( ψ θ * ξ * ) T ( X ) ] 1 .
The gradient of (101) is
g ( θ , ξ ) = P θ ξ ( A ) P θ ξ ( A ) = E [ ψ θ ξ ( X ) | X A ] ,
where ψ θ ξ = P θ ξ ( x ) / P θ ξ ( x ) is the likelihood score function for the combined parameter vector ( θ , ξ ) . Putting things together, the asympotic variance formula (59) for the parametric version of I ^ n + follows from (102)–(104).
The significance level of the FT test can be written as
P 0 ξ ( I ^ n + I min ) = P 0 ξ ( I ^ n + B I min B ) .
Since p min = P 0 ξ ( A ) exp ( I min ) , we have that
I min B = log p min e B P 0 ξ ( A ) .
From this and (100), it follows that the nonparametric test of FT behaves as the corresponding nonparametric test of Proposition 4, with the null probability P 0 ( A ) replaced by P 0 ξ ( A ) , and p min replaced by p min e B . Therefore, the large deviation result (61) follows from (41). In a similar way, the large deviation result for the parametric version of the FT-test (in the special case when θ is a scalar exponential tilting parameter) follows from (100), (105) and Proposition 5. □
Proof of Proposition 7.
Because of (52) and (63), we have that
n ( I ^ n n 0 + I Q + ) = n log Q ^ ( A ) Q ( A ) n n 0 n 0 log P 0 ξ ^ ( A ) P 0 ξ ( A ) ,
where
n log Q ^ ( A ) Q ( A ) L N ( 0 , V 1 )   as   n
and
n 0 log P 0 ξ ^ ( A ) P 0 ξ ( A ) L N ( 0 , V 2 )   as   n 0
respectively. It follows from the proofs of Propositions 4 and 5 that the asymptotic variance for V 1 in (107) is the same as V in (39) and (59), for the nonparametric and parametric versions of Q ^ ( A ) , respectively. The asymptotic variance V 2 in (108) is given by (67). This is proven using the delta method (similarly as for Proposition 6), making use of the fact that ξ ^ is the maximum likelihood estimator of ξ with asymptotic variance that is the inverse E [ ψ ξ T ( X ) ψ ξ ( X ) ] 1 of the Fisher information matrix. The asymptotic normality result (66) then follows from (106)-(108), the fact that n / n 0 λ , and the independence of the two samples.
The large deviations results are proven in a similar way as in Proposition 6, replacing P 0 max ( A ) by P 0 ξ ^ ( A ) . Using statistical consistency ξ ^ p ξ as n 0 , it follows that the large deviation rates C of Proposition 7, for the nonparametric and parametric versions of the FT tests, are the same as in Proposition 6, with bias term B = 0 . □
Details from Example 4. In order to prove that the Metropolis–Hastings-type Markov chain (14) with acceptance probabilities (73) has an equilibrium distribution of P θ , we first notice that for any pair of states x y , the flow of probability mass
P θ ( x ) π θ ( x , y ) = P θ ( x ) q ( x , y ) α θ ( x , y ) = P 0 ( x ) e θ f ( x ) M ( θ ) q ( x , y ) · C e θ f ( y ) P 0 ( y ) q ( y , x ) e θ f ( x ) P 0 ( x ) q ( x , y ) 1 / 2 = C e θ f ( x ) P 0 ( x ) q ( x , y ) e θ f ( y ) P 0 ( y ) q ( y , x ) 1 / 2 M ( θ )
from x to y is symmetric with respect to x and y. Therefore, the flow P θ ( y ) π θ ( y , x ) of probability mass in the opposite direction, from y to x, is the same as in (109). A Markov chain with this property is called reversible [57], pp. 11–12. However, it is well known that P θ is a stationary distribution if the Markov chain is reversible with reversible measure P θ [58], p. 238. If, additionally, the proposal distribution q is such that it is possible to move between any pair of states in a finite number of steps, it follows that the Markov chain is irreducible and hence that P θ is its unique stationary distribution, which is also the equilibrium distribution of the Markov chain [58], p. 232.
We will then motivate formula (77) for the acceptance probability of a Moran model. Assume that the population evolves over time as a Moran model, and that all individuals have type x. If one individual mutates from x to y, because of (75), the relative fitness between the N 1 individuals of type x and the newly mutated individual of type y is
s = e θ f ( y ) / N e θ f ( x ) / N = e θ [ f ( y ) f ( x ) ] / N .
From the theory of Moran models (e.g., [41,59]), it is well known that the fixation probability for the newly mutated individual is
β N ( s ) = ( 1 s 1 ) / ( 1 s N ) , s 1 , 1 / N , s = 1 .
Inserting (110) into (111), we find (when s 1 , or equivalently when Δ = θ [ f ( y ) f ( x ) ] 0 ) that
β N ( s ) = 1 e Δ / N 1 e Δ 1 N · Δ 1 e Δ 1 N · ( 1 + Δ 2 ) ,
which is equivalent to (77).

Author Contributions

D.A.D.-P. and O.H. contributed equally to all parts of the manuscript, including conceptualization, methodology, writing, review, and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors want to thank two anonymous reviewers for valuable comments that considerably improved the quality of the paper. SDG.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Gödel, K. Über Formal Unentscheidbare Sätze der Principia Mathematica und Verwandter Systeme, I. Monatshefte Math. Phys. 1931, 38, 173–198. [Google Scholar] [CrossRef]
  2. Hofstadter, D.R. Gödel, Escher, Bach: An Ethernal Golden Braid; Basic Books: New York, NY, USA, 1999. [Google Scholar]
  3. Whitehad, A.N.; Russell, B. Principia Mathematica; Cambridge University Press: Cambridge, UK, 1927. [Google Scholar]
  4. Wolpert, D.H.; MacReady, W.G. No Free Lunch Theorems for Search; Technical Report SFI-TR-95-02-010; Santa Fe Institute: Santa Fe, NM, USA, 1995. [Google Scholar]
  5. Wolpert, D.H.; MacReady, W.G. No Free Lunch Theorems for Optimization. IEEE Trans. Evol. Comput. 1997, 1, 67–82. [Google Scholar] [CrossRef]
  6. Wolpert, D.H. What is important about the No Free Lunch theorems? In Black Box Optimization, Machine Learning and No-Free Lunch Theorems; Pardalos, P.M., Rasskazova, V., Vrahatis, M.N., Eds.; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
  7. Dembski, W.A.; Marks, R.J., II. Bernoulli’s Principle of Insufficient Reason and Conservation of Information in Computer Search. In Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics, San Antonio, TX, USA, 11–14 October 2009; pp. 2647–2652. [Google Scholar] [CrossRef]
  8. Dembski, W.A.; Marks, R.J., II. Conservation of Information in Search: Measuring the Cost of Success. IEEE Trans. Syst. Man, Cybern. Part Syst. Hum. 2009, 5, 1051–1061. [Google Scholar] [CrossRef]
  9. Hazen, R.M.; Griffin, P.L.; Carothers, J.M.; Szostak, J.W. Functional information and the emergence of biocomplexity. Proc. Natl. Acad. Sci. USA 2007, 104, 8574–8581. [Google Scholar] [CrossRef] [PubMed]
  10. Szostak, J.W. Functional information: Molecular messages. Nature 2003, 423, 689. [Google Scholar] [CrossRef] [PubMed]
  11. Díaz-Pachón, D.A.; Marks, R.J., II. Generalized active information: Extensions to unbounded domains. BIO-Complexity 2020, 2020, 1–6. [Google Scholar] [CrossRef]
  12. Díaz-Pachón, D.A.; Sáenz, J.P.; Rao, J.S.; Dazard, J.E. Mode hunting through active information. Appl. Stoch. Model. Bus. Ind. 2019, 35, 376–393. [Google Scholar] [CrossRef]
  13. Liu, T.; Díaz-Pachón, D.A.; Rao, J.S.; Dazard, J.E. High Dimensional Mode Hunting Using Pettiest Component Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2022, accepted. [Google Scholar] [CrossRef]
  14. Montañez, G.D. The famine of forte: Few search problems greatly favor your algorithm. In Proceedings of the 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, Canada, 5–8 October 2017; pp. 477–482. [Google Scholar] [CrossRef] [Green Version]
  15. Montañez, G.D. A Unified Model of Complex Specified Information. BIO-Complexity 2018, 2018, 1–26. [Google Scholar] [CrossRef]
  16. Díaz-Pachón, D.A.; Sáenz, J.P.; Rao, J.S. Hypothesis testing with active information. Stati. Probab. Lett. 2020, 161, 108742. [Google Scholar] [CrossRef]
  17. Carter, B. Large Number Coincidences and the Anthropic Principle in Cosmology. In Confrontation of Cosmological Theories with Observational Data; Longhair, M.S., Ed.; D. Reidel: Dordrecht, The Netherlands, 1974; pp. 291–298. [Google Scholar]
  18. Barrow, J.D.; Tipler, F.J. The Anthropic Cosmological Principle; Oxford University Press: Oxford, UK, 1988. [Google Scholar]
  19. Davies, P. The Accidental Universe; Cambridge University Press: Cambridge, UK, 1982. [Google Scholar]
  20. Lewis, G.F.; Barnes, L.A. A Fortunate Universe: Life In a Finely Tuned Cosmos; Cambridge University Press: Cambridge, UK, 2016. [Google Scholar] [CrossRef]
  21. Rees, M.J. Just Six Numbers: The Deep Forces That Shape The Universe; Basic Books: New York, NY, USA, 2000. [Google Scholar]
  22. Adams, F.C. The degree of fine-tuning in our universe—Furthermore, others. Phys. Rep. 2019, 807, 1–111. [Google Scholar] [CrossRef]
  23. Barnes, L.A. The Fine Tuning of the Universe for Intelligent Life. Publ. Astron. Soc. Aust. 2012, 29, 529–564. [Google Scholar] [CrossRef]
  24. Tegmark, M.; Rees, M.J. Why is the cosmic microwave background fluctuation level 10−5. Astrophys. J. 1998, 499, 526–532. [Google Scholar] [CrossRef]
  25. Tegmark, M.; Aguirre, A.; Rees, M.; Wilczek, F. Dimensionless constants, cosmology, and other dark matters. Phys. Rev. D 2006, 73, 023505. [Google Scholar] [CrossRef]
  26. Díaz-Pachón, D.A.; Hössjer, O.; Marks, R.J., II. Is Cosmological Tuning Fine or Coarse? J. Cosmol. Astropart. Phys. 2021, 2021, 020. [Google Scholar] [CrossRef]
  27. Díaz-Pachón, D.A.; Hössjer, O.; Marks, R.J., II. Sometimes size does not matter. Found. Phys. 2022. under revision. [Google Scholar]
  28. Dingjan, T.; Futerman, A.H. The fine-tuning of cell membrane lipid bilayers accentuates their compositional complexity. BioEssays 2021, 43, e2100021. [Google Scholar] [CrossRef]
  29. Dingjan, T.; Futerman, A.H. The role of the `sphingoid motif’ in shaping the molecular interactions of sphingolipids in biomembranes. Biochim. Biophys. Acta BBA Biomembr. 2021, 1863, 183701. [Google Scholar] [CrossRef]
  30. Thorvaldsen, S.; Hössjer, O. Using statistical methods to model the fine-tuning of molecular machines and systems. J. Theor. Biol. 2020, 501, 110352. [Google Scholar] [CrossRef]
  31. Asmussen, S.; Glynn, P.W. Stochastic Simulation: Algorithms and Analysis; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
  32. Siegmund, D. Importance Sampling in the Monte Carlo Study of Sequential Tests. Ann. Stat. 1976, 4, 673–684. [Google Scholar] [CrossRef]
  33. Lehmann, E.L.; Casella, G. Theory of Point Estimation, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 1998. [Google Scholar]
  34. Robert, C.P.; Casella, G. Monte Carlo Statistical Methods; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
  35. Hastings, W.K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1970, 57, 97–109. [Google Scholar] [CrossRef]
  36. Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H. Equation of State Calculations by Fast Computing Machines. J. Chem. Phys. 1953, 21, 1087–1092. [Google Scholar] [CrossRef]
  37. Kirkpatrick, S.; Gelatt, C.D., Jr.; Vecchi, M.P. Optimization by Simulated Annealing. Science 1983, 220, 671–680. [Google Scholar] [CrossRef]
  38. Ross, S. Introduction to Probability Models, 8th ed.; Academic Press: Cambridge, MA, USA, 2003. [Google Scholar]
  39. Asmussen, R.; Nerman, O.; Olsson, M. Fitting Phase-type Distributions via the EM Algorithm. Scand. J. Stat. 1996, 23, 419–441. [Google Scholar]
  40. Neuts, M.F. Matrix-Geometric Solutions in Stochastic Models: An Algorithmic Approach; Johns Hopkins University Press: Hoboken, NJ, USA, 1981. [Google Scholar]
  41. Hössjer, O.; Bechly, G.; Gauger, A. On the waiting time until coordinated mutations get fixed in regulatory sequences. J. Theor. Biol. 2021, 524, 110657. [Google Scholar] [CrossRef] [PubMed]
  42. Varadhan, S.R.S. Large Deviations and Applications; SIAM: Philadelphia, PA, USA, 1984. [Google Scholar]
  43. Hössjer, O.; Díaz-Pachón, D.A.; Chen, Z.; Rao, J.S. Active information, missing data, and prevalence estimation. arXiv 2022, arXiv:2206.05120. [Google Scholar] [CrossRef]
  44. Hössjer, O.; Díaz-Pachón, D.A.; Rao, J.S. Active Information, Learning, and Knowledge Acquisition. PsyArXiv 2022. [Google Scholar] [CrossRef]
  45. Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement Learning: A Survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
  46. Durrett, R. Probability Models for DNA Sequence Evolution. Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  47. Moran, P.A.P. Random processes in genetics. Math. Proc. Camb. Philos. Soc. 1958, 54, 60–71. [Google Scholar] [CrossRef]
  48. Moran, P.A.P. A general theory of the distribution of gene frequencies—I. Overlapping generations. Proc. Roy. Soc. Lond. B 1958, 149, 102–112. [Google Scholar]
  49. Mitchell, M. An Introduction to Genetic Algorithms; MIT Press: Cambridge, MA, USA, 1996. [Google Scholar]
  50. Vikhar, P.A. Evolutionary algorithms: A critical review and its future prospects. In Proceedings of the 2016 International Conference on Global Trends in Signal Processing, Information Computing and Communication (ICGTSPICC), Jalgaon, India, 22–24 December 2016; pp. 261–265. [Google Scholar]
  51. Abel, D.L.; Trevors, J.T. Three subsets of sequence complexity and their relevance to biopolymeric information. Theor. Biol. Med. Model 2005, 2, 29. [Google Scholar] [CrossRef] [PubMed]
  52. Durston, K.K.; Chiu, D.K.Y. A functional entropy model for biological sequences. Dynamics of Continuous, Discrete & Impulsive Systems, Series B: Applications & Algorithms, Supplement. In Proceedings of the International Conference on Engineering Applications and Compuational Algorithms, Guelph, ON, Canada, 27–29 July 2005; Liu, X., Ed.; pp. 722–725. [Google Scholar]
  53. Durston, K.K.; Chiu, D.K.Y. Functional Sequence Complexity in Biopolymers. In The First Gene: The Birth of Programming, Messaging and Formal Control; Abel, D.L., Ed.; LongView Press: New York, NY, USA, 2011; pp. 147–169. [Google Scholar]
  54. Durston, K.K.; Chiu, D.K.Y.; Abel, D.L.; Trevors, J.T. Measuring the functional sequence complexity of proteins. Theor. Biol. Med. Model 2007, 4, 47. [Google Scholar] [CrossRef] [PubMed]
  55. Díaz-Pachón, D.A.; Marks, R.J., II. Active Information Requirements for Fixation on the Wright-Fisher Model of Population Genetics. BIO-Complexity 2020, 2020, 1–6. [Google Scholar] [CrossRef]
  56. Kallenberg, O. Foundations of Modern Probability, 3rd ed.; Springer: Berlin/Heidelberg, Germany, 2021; Volume 2. [Google Scholar]
  57. Popov, S. Two-Dimensional Random Walk: From Path Counting to Random Interlacements; Cambridge University Press: Cambridge, UK, 2021. [Google Scholar] [CrossRef]
  58. Grimmett, G.; Stirzaker, D. Probability and Random Processes, 3rd ed.; Oxford Univeristy Press: Oxford, UK, 2001. [Google Scholar]
  59. Komarova, N.L.; Sengupta, A.; Nowak, M.A. Mutation-selection networks of cancer initiation: Tumor suppressor genes and chromosomal instability. J. Theor. Biol. 2003, 223, 433–450. [Google Scholar] [CrossRef]
Figure 1. Plot of I + ( θ ) = lim t I + ( θ , t ) in (80) as a function of θ for a system of molecular machines with transition kernel (73), proposal distribution (78), and null distribution (79). The system has d = 5 components, b = 1.0 , and a = 0.2 (dash−dotted), a = 0 (solid) and a = 0.2 (dashed). The horizontal dotted line corresponds to the functional information I f 0 = 3.47 .
Figure 1. Plot of I + ( θ ) = lim t I + ( θ , t ) in (80) as a function of θ for a system of molecular machines with transition kernel (73), proposal distribution (78), and null distribution (79). The system has d = 5 components, b = 1.0 , and a = 0.2 (dash−dotted), a = 0 (solid) and a = 0.2 (dashed). The horizontal dotted line corresponds to the functional information I f 0 = 3.47 .
Entropy 24 01323 g001
Figure 2. Plot of I + ( θ ) = lim t I + ( θ , t ) in (80) as a function of θ for a system of molecular machines with transition kernel (73), proposal distribution (78), and null distribution (79). The system has d = 5 components, b = 0.5 , and a = 0.2 (dash−dotted), a = 0 (solid), and a = 0.2 (dashed). The horizontal dotted line corresponds to the functional information I f 0 = 5.09 .
Figure 2. Plot of I + ( θ ) = lim t I + ( θ , t ) in (80) as a function of θ for a system of molecular machines with transition kernel (73), proposal distribution (78), and null distribution (79). The system has d = 5 components, b = 0.5 , and a = 0.2 (dash−dotted), a = 0 (solid), and a = 0.2 (dashed). The horizontal dotted line corresponds to the functional information I f 0 = 5.09 .
Entropy 24 01323 g002
Figure 3. Plot of I + ( θ , t ) in (80) (dashed) and I s + ( θ , t ) (solid) as a function of t for a system of molecular machines with transition kernel (73), proposal distribution (78), and null distribution (79). The system has d = 5 components and θ = 2.5 . The upper (lower) row corresponds to b = 1 ( b = 0.5 ), whereas the left (right) column corresponds to a = 0.2 ( a = 0.2 ). The horizontal lines in each figure illustrate I + ( θ ) (dash−dotted) and the functional information I f 0 (dotted).
Figure 3. Plot of I + ( θ , t ) in (80) (dashed) and I s + ( θ , t ) (solid) as a function of t for a system of molecular machines with transition kernel (73), proposal distribution (78), and null distribution (79). The system has d = 5 components and θ = 2.5 . The upper (lower) row corresponds to b = 1 ( b = 0.5 ), whereas the left (right) column corresponds to a = 0.2 ( a = 0.2 ). The horizontal lines in each figure illustrate I + ( θ ) (dash−dotted) and the functional information I f 0 (dotted).
Entropy 24 01323 g003
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Díaz-Pachón, D.A.; Hössjer, O. Assessing, Testing and Estimating the Amount of Fine-Tuning by Means of Active Information. Entropy 2022, 24, 1323. https://doi.org/10.3390/e24101323

AMA Style

Díaz-Pachón DA, Hössjer O. Assessing, Testing and Estimating the Amount of Fine-Tuning by Means of Active Information. Entropy. 2022; 24(10):1323. https://doi.org/10.3390/e24101323

Chicago/Turabian Style

Díaz-Pachón, Daniel Andrés, and Ola Hössjer. 2022. "Assessing, Testing and Estimating the Amount of Fine-Tuning by Means of Active Information" Entropy 24, no. 10: 1323. https://doi.org/10.3390/e24101323

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop