Assessing, Testing and Estimating the Amount of Fine-Tuning by Means of Active Information

Díaz-Pachón, Daniel Andrés; Hössjer, Ola

doi:10.3390/e24101323

Open AccessEditor’s ChoiceArticle

Assessing, Testing and Estimating the Amount of Fine-Tuning by Means of Active Information

by

Daniel Andrés Díaz-Pachón

¹

and

Ola Hössjer

^2,*

¹

Division of Biostatistics, University of Miami, Miami, FL 33136, USA

²

Department of Mathematics, Stockholm University, 114 19 Stockholm, Sweden

^*

Author to whom correspondence should be addressed.

Entropy 2022, 24(10), 1323; https://doi.org/10.3390/e24101323

Submission received: 23 August 2022 / Revised: 15 September 2022 / Accepted: 19 September 2022 / Published: 21 September 2022

(This article belongs to the Special Issue Recent Advances in Statistical Theory and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

A general framework is introduced to estimate how much external information has been infused into a search algorithm, the so-called active information. This is rephrased as a test of fine-tuning, where tuning corresponds to the amount of pre-specified knowledge that the algorithm makes use of in order to reach a certain target. A function f quantifies specificity for each possible outcome x of a search, so that the target of the algorithm is a set of highly specified states, whereas fine-tuning occurs if it is much more likely for the algorithm to reach the target as intended than by chance. The distribution of a random outcome X of the algorithm involves a parameter

θ

that quantifies how much background information has been infused. A simple choice of this parameter is to use

θ f

in order to exponentially tilt the distribution of the outcome of the search algorithm under the null distribution of no tuning, so that an exponential family of distributions is obtained. Such algorithms are obtained by iterating a Metropolis–Hastings type of Markov chain, which makes it possible to compute their active information under the equilibrium and non-equilibrium of the Markov chain, with or without stopping when the targeted set of fine-tuned states has been reached. Other choices of tuning parameters

θ

are discussed as well. Nonparametric and parametric estimators of active information and tests of fine-tuning are developed when repeated and independent outcomes of the algorithm are available. The theory is illustrated with examples from cosmology, student learning, reinforcement learning, a Moran type model of population genetics, and evolutionary programming.

Keywords:

active information; exponential tilting; fine-tuning; functional information; large deviations; Markov chains; Metropolis–Hastings; Moran model; statistical estimation and testing

1. Introduction

When Gödel published their incompleteness theorems [1], the mathematical world was shaken such that to date it has neither recovered nor fully assimilated the consequences [2]. Hilbert’s program to base mathematics on a finite set of axioms had previously been pursued by Alfred North Whitehead and Bertrand Russell [3]. However, this approach turned out to be wrong when Gödel proved that no finite set of axioms in a formal system can prove all its true statements, including its own consistency. At a similar but lesser scale, when David Wolpert and William MacReady published their No Free Lunch Theorems (NFLTs, [4,5]), there was disquiet in the community because these results imply that there is no one-size-fit-all algorithm that can do well in all searches [6], and thus that a “theory of everything” is not possible in machine learning. Wolpert and MacReady concluded that it was necessary to incorporate “problem-specific knowledge into the behavior of the algorithm” [5]. Thus, active information (actinfo) was introduced in order to measure the amount of information carried out by such problem-specific knowledge [7,8]. More specifically, the NFLTs say that no search works better on average than a blind search, i.e., a search according to a uniform distribution. Accordingly, actinfo is defined as

\begin{matrix} I^{+} = log \frac{P (A)}{P_{0} (A)}, \end{matrix}

(1)

where

A \subset Ω

is the non-empty target of the search algorithm, a subset of the finite sample space

Ω

, and

P_{0}

is a uniform probability measure (

P_{0} (A) = | A | / | Ω |

). P must be seen here as the probability measure induced by the problem-specific knowledge of the researcher, whereas

P_{0}

is the underlying distribution assumed in the NFLTs. This corresponds to absence of problem-specific knowledge, in accordance with Bernoulli’s Principle of Insufficient Reason (PoIR). An equivalent characterization of actinfo is the reduction in functional information

I^{+} = I_{f 0} - I_{f} = - log P_{0} (A) - (- log P (A))

(2)

between algorithms that do not and do make use background knowledge. The name functional information was introduced by Szostak and collaborators [9,10]. It refers to applications wherein A corresponds to all outcomes of an algorithm that are functional according to some criterion. Then,

I_{f 0}

and

I_{f}

are the self-information (measured in nats) of the event that an algorithm X produces a functional outcome, given that it was generated under

P_{0}

and P, respectively.

Suppose we do not know whether problem-specific knowledge has been used or not when the random search

X \in Ω

was generated. This corresponds to a hypothesis testing problem

\begin{matrix} H_{0} : X \sim P_{0}, \\ H_{1} : X \sim P, \end{matrix}

(3)

where data are generated from distributions

P_{0}

and P under the null and alternative hypotheses

H_{0}

and

H_{1}

, respectively. It follows from (1) that

I^{+}

is the log likelihood ratio when testing

H_{0}

against

H_{1}

, if data are censored so that only

X \in A

is known.

When the sample space

Ω

is finite or a continuous, bounded subset of a Euclidean space, the PoIR can be motivated by the fact that the uniform distribution maximizes the Shannon entropy, since it thereby maximizes ignorance about the outcome of X. However, the uniform distribution is neither a feasible choice of

P_{0}

for infinite, countable sample spaces nor for continuous, unbounded samples spaces. For this reason, actinfo was generalized to deal with unbounded spaces [11], by choosing

P_{0}

to maximize Shannon entropy under side constraints

ξ

, such as the existence of various moments. This gives rise to a family of null distributions

P_{0} = P_{0 ξ}

, with a

ξ

a nuisance parameter that has to be estimated or controlled for in order to estimate or give bounds for the active information.

Actinfo has also been used for mode detection in unsupervised learning, among other applications [12,13]. Based on previous works by Montañez [14,15], actinfo has been used in the past for testing hypotheses [16]. More specifically, P is regarded as a random measure in [16], so that

I^{+}

is random as well and expressions for the tail probability of

I^{+}

can be found.

1.1. Fine-Tuning

Fine-tuning (FT) was introduced by Carter in physics and cosmology [17]. According to FT, the constants in the laws of nature and/or the boundary conditions in the standard models of physics must belong to intervals of low probability in order for life to exist. Since its inception, FT has generated a great deal of fascination, seen in multiple divulgation books (e.g., [18,19,20,21]) and scientific articles (e.g., [22,23,24,25]). For a given constant of nature X, the connection between FT and active information can be described in three steps:

(i): Establishing the life-permitting interval (LPI) A that allows the existence of life for the constant, with $Ω = (0, \infty) = R^{+}$ or $Ω = R$ denoting the range of values that this constant could possibly take, including those that do not permit life.
(ii): Determining the probability $P_{0} (A)$ of such a LPI. If $P_{0} = P_{0 ξ}$ contains unknown parameters $ξ$ , find an upper bound

$P_{0 \max} (A) = max_{ξ} P_{0 ξ} (A)$

(4)

of $P_{0} (A)$ .
(iii): Suppose that $H_{1}$ corresponds to an agent who uses background knowledge of what is required for life to exist in order to bring about a constant of nature X that definitely permits life ( $P (A) = 1$ ). The active information $I^{+} = I_{f 0} = - log P_{0} (A)$ is then a measure of how much background knowledge this agent has infused. Following [26,27], we conclude that X is finely tuned when the lower bound $- log P_{0 \max} (A)$ of $I^{+} = I_{f 0}$ is large enough. That is, FT corresponds to infusing a high degree of background knowledge into a problem.

Fine-tuning has also been used in biology. Dingjan and Futerman explored it for cell membranes [28,29], whereas Thorvaldsen and Hössjer [30] formalized it for a large class of biological models. According to [30], a system is fine-tuned if it satisfies the two following requirements:

(a): It has an independent specification;
(b): It is very unlikely to occur by chance.

1.2. The Present Article

In this article, actinfo will not only be used in the algorithmic sense. It will also be employed for testing the presence of and estimating the degree of fine-tuning (FT) of a search algorithm or agent who brings about X. Our definition of FT relies on (a) and (b), and in order to formalize these two concepts, we introduce a specificity function f, which quantifies, in terms of

f (x)

, how specified an outcome

x \in Ω

is. The target A, on the other hand, is a set of highly specified states, that is, all states with a degree of specificity that exceed a given threshold

f_{0}

. Then,

I^{+}

in (1) is a test statistic for testing whether an algorithm has a much larger probability of reaching the set of highly specified states compared to a random search. This is a test of FT, since reaching the target corresponds to specificity (a), whereas reaching it with a much higher probability than expected by chance corresponds to (b).

To calculate

I^{+}

, the distributions

P_{0}

and P of the random search algorithm under

H_{0}

and

H_{1}

, respectively, need to be defined. As mentioned above, the null distribution

P_{0}

is typically chosen according to some criterion, such as a maximizer of entropy, possibly with some extra constraints on moments for unbounded

Ω

, which was the strategy implemented in [26,27]. Another possibility is to choose

P_{0}

as the equilibrium distribution of a Markov chain that models the dynamics of the system under the null hypothesis of no external input. In general,

P_{0} = P_{0 ξ t}

involves a number of nuisance parameters

ξ

, and sometimes, also the time point t when an algorithm that does not make use of external information stops. The choice of

P = P_{θ ξ t}

is problem specific, and it possibly involves the nuisance parameters

ξ

of the null distribution, the time point t when the algorithm stops, as well as the tuning parameters

θ

that correspond to infusing the background knowledge into the search problem. Therefore, in its most general form, the actinfo (1) is a function

I^{+} = I^{+} (θ, ξ, t)

of the tuning parameters

θ

, the nuisance parameters

ξ

, and the time point t.

This general framework has many applications based on different choices of f, A,

P_{0}

, and P. For some models, f is a binary function that quantifies functionality, so that A is the set of objects of a certain type (e.g., universes, proteins, protein complexes, or cellular networks) that are functional or permit life, among the set

Ω

of all such objects.

Another possibility is to choose A as the set of populations x whose (expected) fitness

f (x)

exceeds a given threshold. In this setting,

P_{0 ξ t} (A)

corresponds to the probability that a randomly chosen population would evolve and reach target A of high fitness at time t, given that no background knowledge of the specificity function f is used to generate X, so that natural selection does not occur. The functional information

I_{f 0} = - log P_{0 ξ t} (A)

corresponds to the amount of external information that an evolutionary algorithm infuses under

H_{1}

, given that it brings about X so that A happens with certainty (

P (A) = 1

) within time t. In this case, the population is finely tuned when

I_{f 0}

is large enough. More generally, we say that an evolutionary algorithm that generates

X \sim P = P_{θ ξ t}

after t time steps is finely tuned when

I^{+} (θ, ξ, t)

is large enough. Typically,

θ

involves the selection parameters that determine to which extent a population evolves towards higher fitness.

A third possibility is to choose

f (X) = X

as the test score of a randomly chosen student, whereas

A = [f_{0}, \infty)

is the set of results of those students who pass the test with a score of at least

f_{0}

. Assume that

f (X) \sim N (ξ, 1)

for a randomly chosen student who did not prepare for the test (

H_{0}

), whereas

f (X) \sim N (ξ + θ t, 1)

for a randomly chosen student who prepared for the test for a period of length t (

H_{1}

). Then,

P_{0} (A) = P_{0 ξ} (A) = 1 - Φ (f_{0} - ξ)

, whereas

P (A) = P_{θ ξ t} (A) = 1 - Φ (f_{0} - ξ - θ t)

, where

Φ

is the cumulative distribution function of a standard normal distribution. In particular, the tuning parameter

θ > 0

corresponds to the amount of knowledge that a student is expected to generate per unit time of study.

The unified treatment of search problems and FT of this paper, is organized as follows: Section 2 introduces the specification function f and the set A of highly specified states. Section 3 introduces a class of probability distributions

P = P_{θ}

for which the specificity function f is used to exponentially tilt the null distribution

P_{0}

, so that outcomes with high specificity are more likely to occur, and with a scalar tuning parameter

θ

of

P_{θ}

that corresponds to the amount of exponential tilting. A proof is presented that it is possible to obtain a Metropolis–Hastings type Markov chain in discrete time

t = 0, 1, 2, \dots

, whose outcome

X = X_{t}

at time t has the aforementioned exponentially tilted distribution under equilibrium, that is, when t is large. The corresponding actinfo

I^{+} (θ, t)

is shown to increase monotonically with t towards an equilibrium limit. The actinfo of a search algorithm

X = X_{t \land T}

that stops at time T, when the targeted set A of highly specified states has been reached, is also shown to increase more rapidly. Section 4 introduces various nonparametric and parametric estimators of actinfo, and corresponding tests of FT, when n repeated and independent outputs of the search algorithm are available. In particular, large deviations theory is used to prove that the significance levels of these tests, i.e., the probability of detecting FT under

H_{0}

, goes to zero at an exponential rate when the sample size n increases. Section 5 presents a number of examples from cosmology, student learning, reinforcement learning, and population genetics, that illustrate our approach. A discussion in Section 6 follows, whereas the proof and further details about the models are presented in Section 7.

2. Specificity and Target

Consider a function

f : Ω \to R

and assume that the objective of the search algorithm, or the agent that brings about X, is to find regions in

Ω

where f is large. The rationale for this is an independent specification, where a more specified state

x \in Ω

corresponds to a larger

f (x)

. It is further assumed that the target set in (1) is given by

\begin{matrix} A = {x \in Ω; f (x) \geq f (x_{0})} \end{matrix}

(5)

for some

x_{0} \in Ω

. This implies that the purpose of the search algorithm or the agent is to bring about an X that is at least as specified as

x_{0}

. We will refer to f as a specificity function of the agent or an objective function of the search algorithm.

Several examples of specificity functions are provided in Section 5. For instance, Example 2 deals with student learning. For a special case of this model,

f (x) = x

represents the test score of a student, whereas

x_{0}

is a reference value that corresponds to the minimum score needed to pass the test.

For cosmological FT (Example 1), x is the value of a particular constant of nature and the specificity function equals

f (x) = 1_{{x \in A}},

(6)

where

1_{{\cdot}}

is the indicator function. That is, f has a binary range, with

f (x) = 1

and 0 corresponding to whether x permits a universe with life, and in particular,

x_{0}

is a universe that permits life. From this, A is the LPI of this constant. Moreover, X is the value of this constant of nature for a randomly generated universe, with a distribution that either incorporates external information (

H_{1}

) or not (

H_{0}

).

In the context of proteins, x is taken to be an amino acid sequence, whereas

f (x)

in (6) quantifies whether the protein that the amino acid corresponds to is functional (1) or not (0). For instance, X could be the outcome of a random evolutionary process, the goal of which is to generate a functioning protein, and this process either makes use of external information (

H_{1}

) or not (

H_{0}

). In a more refined example (Example 4), x is a molecular machine that consists of a possibly large number of proteins (or parts), and

f (x)

is (a monotone function of) the fitness of x.

Interpretation of Target

There are at least two ways of interpreting

x_{0}

, and hence also the target set A. According to the first interpretation,

x_{0}

is the outcome of random variable

X^{'} \in Ω

; that is, the outcome of a first search. Suppose that X is another random variable that represents a second (possibly future) search, independent of

X^{'}

. Then, if we condition the outcome

x_{0}

of the first search, the actinfo

I^{+}

in (1) is the log likelihood ratio for the event that the second search variable X is at least as specified as the observed value

f (x_{0})

of the first search.

There is, however, no need to associate

x_{0}

in (5) with a first search variable

X^{'}

. Instead, some a priori information may be used to define which values of f represent a high amount of specificity. This gives rise to the second interpretation of

x_{0}

, according to which

x_{0}

is used for defining outcomes with a high and low degree of specificity, using

f_{0} = f (x_{0})

as a cutoff. According to this interpretation, the two sets A in (5) and its complement

\begin{matrix} A^{c} = Ω \ A = {x; f (x) < f (x_{0})} \end{matrix}

represent a dichotomization of specificity, so that A and

A^{c}

consist of all states with high and low specificity, respectively. With this interpretation of x,

I^{+}

is the log likelihood ratio for testing FT based on the search variable X. In particular, suppose that the specificity function f is bounded, i.e.,

\begin{matrix} f_{\max} = max_{x \in Ω} f (x) < \infty . \end{matrix}

(7)

Then, the most stringent definition of high specificity,

\begin{matrix} f_{0} = f_{\max}, \end{matrix}

(8)

only regards outcomes with a maximal value of f as highly specified, so that

\begin{matrix} A = Ω_{\max} = {x \in Ω; f (x) = f_{\max}} . \end{matrix}

(9)

Note that (6) is a special case of (9).

3. Active Information for Exponentially Tilted Systems

Throughout Section 3,

ξ

is assumed to be known and the null distribution does not involve any time index t. Therefore,

P_{0}

is known, whereas

P = P_{θ t}

involves the tuning parameters

θ

and the time index t. It will be further assumed in Section 3.1 and Section 3.2 that the system is in equilibrium, or that the time index t is fixed, so that t can also be dropped under

H_{1}

(

P = P_{θ}

).

3.1. Exponential Tilting

Let

P_{θ}

be an exponentially tilted version of

P_{0}

for some scalar tuning parameter

θ > 0

, which will also be called a tilting parameter. Exponential tilting is often used for rare events simulation [31,32]. Here, f is used to define the tilted version of

P_{0}

as

\begin{matrix} P_{θ} (x) = \frac{e^{θ f (x)}}{M (θ)} P_{0} (x), \end{matrix}

(10)

with

\begin{matrix} M (θ) = \sum_{x \in Ω} e^{θ f (x)} P_{0} (x) \end{matrix}

(11)

a normalizing constant assuring that

P_{θ}

is a probability measure. For countable sample spaces

Ω

, we interpret

P_{0} (x)

and

P_{θ} (x)

as the probability masses, whereas for continuous sample spaces, they are probability densities and the sum in (11) is replaced by an integral. The larger the tilting parameter

θ > 0

, the more the probability mass of

P_{θ}

concentrates on regions of large f. In particular,

P_{\infty}

, the weak limit of

P_{θ}

as

θ \to \infty

, is supported on (9) whenever (7) holds.

The parametric family

\begin{matrix} P = {P_{θ}; θ \geq 0} \end{matrix}

(12)

of distributions is an exponential family [33] (Section 1.5), and each

P_{θ} \in P

gives rise to a separate version of actinfo. This is summarized in the following proposition (cf. Section 7 for a proof):

Proposition 1.

Suppose the target set A is defined as in (5) for some

x_{0} \in Ω

such that

P_{0} (A) > 0

. Then,

P_{θ} (A)

is a strictly increasing function of

θ \geq 0

with

P_{\infty} (A) = 1

. Consequently, the actinfo

\begin{matrix} I^{+} (θ) = log \frac{P_{θ} (A)}{P_{0} (A)} \end{matrix}

(13)

is a strictly increasing function of

θ \geq 0

, with

I^{+} (0) = 0

and

I^{+} (\infty) = I_{f 0} = - log P_{0} (A)

.

The intuitive interpretation of Proposition 1 is that the larger

θ

is, the more problem-specific knowledge is infused into

P_{θ}

in terms of shifting probability mass towards regions in

Ω

where f, the specificity function, is large.

A simple instance of exponential tilting is the student learning example of Section 1.2. Recall that

f (x) = x

is the test score of a student, with

X \sim N (ξ, 1)

for a randomly chosen student who did not prepare for the test (

H_{0}

), whereas

X \sim N (ξ + θ, 1)

is the test score of a randomly chosen student who prepared for the test during

t = 1

units of time (

H_{1}

). It is clear that

\begin{matrix} P_{0} (x) & = & e^{- {(x - ξ)}^{2} / 2} / \sqrt{2 π}, \\ P_{θ} (x) & = & e^{- {(x - ξ - θ)}^{2} / 2} / \sqrt{2 π} = P_{0} (x) e^{θ x} / M (θ) . \end{matrix}

3.2. Metropolis–Hastings Systems with Exponential Tilting Equilibrium

Inspired by Markov Chain Monte Carlo methods [34], consider a Markov chain

X_{0}, X_{1}, \dots \in Ω

for which

P_{θ}

is the equilibrium distribution. Consequently, if

P = P_{θ}

(that is, under the alternative hypothesis

H_{1}

in (3) when

θ > 0

),

X = X_{t}

may be interpreted as the outcome of an algorithm after t iterations, provided that t is so large that the equilibrium has been reached. The assumption is made that this algorithm knows f and tries to explore the whole state space

Ω

. If the Markov chain has an equilibrium distribution (10), this corresponds to an algorithm that favors jumps towards the regions of large f when

θ > 0

, an effect which is accentuated the higher the value of

θ

is. In further detail, the transition kernel of the chain is an instance of the well-known Metropolis–Hastings (MH) algorithm [35,36], which is closely related to simulated annealing [37]. This kernel has a probability or density

\begin{matrix} π_{θ} (x, y) = r_{θ} (x) δ (x, y) + α_{θ} (x, y) q (x, y) \end{matrix}

(14)

for jumps from x to y, where

δ (x, \cdot)

is a point mass at

x \in Ω

,

q (x, \cdot)

is a proposal distribution of jumps from a current position x of the Markov chain,

\begin{matrix} α_{θ} (x, y) = min [1, \frac{e^{θ f (y)} P_{0} (y) q (y, x)}{e^{θ f (x)} P_{0} (x) q (x, y)}] \end{matrix}

(15)

is the probability of accepting a proposed move from x to y, whereas

\begin{matrix} r_{θ} (x) = 1 - \sum_{y \in Ω} α_{θ} (x, y) q (x, y) \end{matrix}

(16)

is the probability that the Markov chain rejects a proposed move away from x (for continuous sample spaces

q (x, \cdot)

is a probability density and then the sum in (16) is replaced by an integral). The transition of the Markov chain from

X_{t} = x

to the next state

X_{t + 1}

is described in two steps as follows. First, a candidate

Y \sim q (x, \cdot)

is proposed. Then, in the second step, this candidate is either accepted with a probability of

α_{θ} (x, Y)

, so that

X_{t + 1} = Y

, or it is rejected with probability

1 - α_{θ} (x, Y)

, so that

X_{t + 1} = X_{t}

. It is well known that

P_{θ}

is the equilibrium distribution of this Markov chain whenever it is irreducible; that is, provided the proposal distribution q is defined in such a way that moving between any pair of states in

Ω

in a finite number of steps is possible [38], pp. 243–245.

In particular, if q is symmetric and

P_{0}

is uniform, then a proposed upward move with

f (Y) > f (x)

and

P_{θ} (Y) > P_{θ} (x)

is always accepted, whereas a proposed downward move with

f (Y) < f (x)

is accepted with a probability of

P_{θ} (Y) / P_{θ} (x)

. The Markov chain only makes local jumps if

q (x, \cdot)

puts all its probability mass in a small neighborhood of x, for any

x \in Ω

. At the other extreme is a chain with the global proposal distribution

q (x, \cdot) \sim P_{θ}

for any

x \in Ω

; all proposed jumps of this chain are then accepted (

α (x, y) = 1

), and

{X_{t}}_{t = 1}^{\infty}

is a sequence of independent and identically distributed (i.i.d.) random variables with

X_{t} \sim P_{θ}

.

The choice of proposal distribution q is problem specific. In this section, we defined q for the Metropolis–Hastings type algorithms that require knowledge of the specificity function f, since the acceptance probability (15) is a function of f. Proposed moves also occur for evolutionary algorithms (Examples 4 and 5 of Section 5). These algorithms are typically the result of many small changes, with specificity corresponding to functionality or fitness. The proposed moves are local mutations that either survive (are accepted) or do not. Other algorithms (such as reinforcement learning in Example 3 of Section 5) only make use of estimates of the specificity function. However, it is still meaningful for these algorithms to talk about proposed moves that are initially large (exploration phase) followed by a subsequent period of small or no moves (exploitation phase). In the context of Metropolis–Hastings algorithms, this is the strategy of simulated annealing, where large moves are initially proposed (corresponding to high temperatures), followed by subsequent small proposed moves (corresponding to low temperatures).

3.3. Active Information for Metropolis–Hastings Systems in Non-Equilibrium

Suppose, for simplicity, that the sample space

Ω

is finite, and that the states in

Ω

are listed in some order. Let

\begin{matrix} P_{0} = (P_{0} (x); x \in Ω) \end{matrix}

(17)

be a row vector of length

| Ω |

with all the null distribution probabilities, and let

\begin{matrix} Π_{θ} = (π_{θ} (x, y); x, y \in Ω) \end{matrix}

(18)

be a square matrix of order

| Ω |

that defines the transition kernel of the Markov chain

{X_{t}}_{t = 0}^{\infty}

of Section 3.2. If

X_{0} \sim P_{0}

, then by the Kolmogorov–Chapman equation

X_{t} \sim P_{θ t}

, where

\begin{matrix} (P_{θ t} (x); x \in Ω) = P_{θ t} = P_{0} Π_{θ}^{t} . \end{matrix}

(19)

Hence, if

P = P_{θ t}

, then

X = X_{t}

corresponds to observing the Markov chain at time t, under the alternative hypothesis

H_{1}

in (3). Some basic properties of the corresponding actinfo are summarized in the following proposition, which is proved in Section 7:

Proposition 2.

Suppose that

X = X_{t}

is obtained by iterating t times a Markov chain with initial distribution (17) and transition kernel (18). The actinfo then equals

I^{+} (θ, t) = log \frac{P_{θ t} (A)}{P_{0} (A)} = log \frac{P_{0} Π_{θ}^{t} v}{P_{0} v},

(20)

where

v

is a column vector of length

| Ω |

with ones in positions

x \in A

and zeros in positions

x \in A^{c}

. In particular,

I^{+} (θ, 0) = 0

and

lim_{t \to \infty} I^{+} (θ, t) = I^{+} (θ) .

(21)

Therefore,

I^{+} (θ, t) > 0

corresponds to knowledge of f being used to generate t jumps of the Markov chain, under the alternative hypothesis

H_{1}

in (3).

3.4. Active Information for Metropolis–Hastings Systems with Stopping

In Section 3.3,

P \sim P_{θ t}

was obtained by starting a random search with null distribution

P_{0}

, and then iterating the Markov chain of Section 3.2t times. However, knowledge of f can be utilized even more and stop the Markov chain if the target A in (5) is reached before time t. This can be formalized by introducing the stopping time

\begin{matrix} T = min {t \geq 0; X_{t} \in A} \end{matrix}

(22)

and letting

\begin{matrix} P_{θ t s} (x) = P (X_{t \land T} = x) \end{matrix}

(23)

be the probability distribution of the stopped Markov chain

X_{t \land T}

, with the last index s in (23) being an acronym for stopping. In particular,

\begin{matrix} P_{θ t s} (A) = \sum_{x \in A} P_{θ t s} (x) = P (T \leq t) \end{matrix}

(24)

is the probability of reaching the target A for the first time after t iterations or earlier. The theory of phase-type distributions can then be used to compute the target probability

P_{θ t s} (A)

in (23) [39,40]. To this end, clump all states

x \in A

into one absorbing state, and decompose the transition kernel in (18) according to

\begin{matrix} Π_{θ} = (\begin{matrix} Π_{θ}^{na} & Π_{θ}^{na, a} \\ 0 & 1 \end{matrix}), \end{matrix}

(25)

where

Π_{θ}^{na}

is a square matrix of order

| A^{c} |

containing the transition probabilities between all non-absorbing states in

A^{c}

, whereas

Π_{θ}^{na, a}

is a column vector of length

| A^{c} |

with transition probabilities

π (x, A)

from all the non-absorbing states

x \in A^{c}

into the absorbing state A. Moreover,

P_{0}^{na} = (P_{0} (x); x \in A^{c})

is a row vector of length

| A^{c} |

that is the restriction of the start-distribution

P_{0}

in (17) to all non-absorbing states. Then

\begin{matrix} P_{θ t s} (A) = 1 - P_{0}^{na} {(Π_{θ}^{na})}^{t} 1, \end{matrix}

(26)

where

1

is a column vector of

| A^{c} |

ones.

The actinfo

I_{s}^{+}

of a search procedure with stopping is thus defined:

Proposition 3.

Suppose that

X = X_{t}

is obtained by iterating a Markov chain with an initial distribution (17) and a transition kernel (18) (for some

θ \geq 0

) at most t times, and stopping whenever the set A is reached. Then, the actinfo is given by

\begin{matrix} I_{s}^{+} (θ, t) = log \frac{P_{θ t s} (A)}{P_{0} (A)} = log \frac{1 - P_{0}^{na} {(Π_{θ}^{na})}^{t} 1}{P_{0} v}, \end{matrix}

(27)

with

P_{0}

and

v

as in Proposition 2, whereas

P_{0}^{na}

,

Π_{θ}^{na}

, and

1

are defined below (25) and (26). This actinfo satisfies

\begin{matrix} I_{s}^{+} (θ, t) \geq I^{+} (θ, t) \end{matrix}

(28)

and

I_{s}^{+} (θ, t)

is a non-decreasing function of t such that

\begin{matrix} lim_{t \to \infty} I_{s}^{+} (θ, t) = I_{f 0} \end{matrix}

(29)

and

\begin{matrix} \sum_{t = 0}^{\infty} (1 - P_{0} (A) e^{I_{s}^{+} (θ, t)}) = E (T) . \end{matrix}

(30)

Proposition 3 is proven in Section 7. Inequality (28) states that, for a search procedure with t iterations, knowledge about f that is used for stopping the Markov chain in (18) will increase the actinfo, regardless of whether knowledge about f was used (

θ > 0

) or not (

θ = 0

) when iterating the Markov chain. Equation (29) is a consequence of the fact that target A is eventually reached with probability 1, so that the actinfo of a search procedure with stopping equals the functional information

I_{f 0} = - log P_{0} (A)

after many iterations of the Markov chain. Moreover, Equation (30) tells that the rate at which

P_{0} (A) e^{I_{s}^{+} (θ, t)}

approaches 1 is determined by the expected waiting time

E (T)

of reaching the target.

From Proposition 3, actinfo for a system with stopping is closely related to the phase-type distribution of the waiting time T until the target is reached. This has been studied in [41], in the context of the expression of a number of genes, with x being the collection of the regulatory regions of all these genes.

4. Estimating Active Information and Testing Fine-Tuning

In Section 3, we gave explicit expressions of the actinfo, for Metropolis–Hastings algorithms with a scalar tuning parameter

θ

. In general, however, it might be infeasible to calculate

I^{+}

, either because the sample space is very large, or the nuisance parameters

ξ

and/or the tuning parameters

θ

are unknown. If is of interest then to consider ways of estimating

I^{+}

from data, for instance through Monte Carlo-based methods. To this end, we will assume that the random search algorithm is repeated independently, under the same conditions, n times. For instance, suppose that

{X_{i t}}_{t = 0}^{\infty}

corresponds to independent realizations

i = 1, \dots, n

of a search algorithm. If these independent realizations are recorded or stopped at one single time point, the outcome is either

X_{i} = X_{i t}

for

i = 1, \dots, n

, or

X_{i} = X_{i, t \land T_{i}}

, for

i = 1, \dots, n

, depending on whether the search algorithm is stopped at a fixed time point t or at random time points

{T_{i}}_{i = 1}^{n}

. In either case, an output of i.i.d. random variables

X_{1}, \dots, X_{n} \sim Q

(31)

is obtained. These repeated outcomes of the search algorithm will be used to test for and estimate the degree of fine-tuning. The methodology depends on whether the null distribution

P_{0}

is known or involves unknown nuisance parameters.

4.1. Null Distribution Known

Suppose the null distribution

P_{0}

is known. The sample in (31) is then used for testing between the two hypotheses

\begin{matrix} \begin{matrix} H_{0} : & Q = P_{0}, \\ H_{1} : & Q \in P_{1}, \end{matrix} \end{matrix}

(32)

with

P_{1} = {P; P (A) \geq p_{\min}}

(33)

the set of distributions that correspond to fine-tuning. Suppose an estimate

\hat{Q} (A)

of the probability that

X \in A

is computed from data (31), with an associated empirical actinfo

\begin{matrix} {\hat{I}}^{+} = {\hat{I}}_{n}^{+} = log \frac{\hat{Q} (A)}{P_{0} (A)} . \end{matrix}

(34)

If

\hat{Q} (A)

is a consistent estimator of

Q (A)

, then for large sample sizes,

{\hat{I}}^{+}

will be close to

\begin{matrix} I_{Q}^{+} = log \frac{Q (A)}{P_{0} (A)}, \end{matrix}

(35)

which equals 0 under

H_{0}

and

I^{+} = I_{P}^{+}

under

H_{1}

, for some particular

P \in P_{1}

. To test

H_{0}

against

H_{1}

,

\begin{matrix} reject H_{0} when {\hat{I}}^{+} \geq I_{\min}, \end{matrix}

(36)

where

I_{\min}

is a pre-specified lower bound on the range of values of the actinfo that corresponds to FT.

4.1.1. Nonparametric Estimator and Test

In Section 3,

P = P_{θ}

,

P = P_{θ t}

, or

P = P_{θ t s}

were used for distributions that make use of pre-specified knowledge. These distributions involve the tilting parameter

θ

, and possibly also the number of iterations t of the algorithm and a stopping time T. In this section, however, no other assumption than

P \in P_{1}

is made on P, and a nonparametric version of the empirical actinfo is used. The fraction

\begin{matrix} \hat{Q} (A) = \frac{1}{n} \sum_{i = 1}^{n} 1_{{X_{i} \in A}} \end{matrix}

(37)

of random searches that fall into A is used as an estimate of

Q (A)

. Therefore, (37) only requires the knowledge of the set A, not of the function f.

The following result establishes the asymptotic normality of the nonparametric version of the estimator

{\hat{I}}^{+}

in (34). Moreover, large deviations [42] are used to show that the significance level of the nonparametric version of the FT test (36) goes to zero exponentially fast with n (see Section 7 for more details of the proof).

Proposition 4.

Suppose the empirical actinfo

{\hat{I}}_{n}^{+}

in (34) is computed nonparametrically, using (37) as an estimate of the target probability

Q (A)

. Then,

{\hat{I}}_{n}^{+}

is an asymptotically normal estimator of

I_{Q}^{+}

in (35), in the sense that

\begin{matrix} \sqrt{n} ({\hat{I}}_{n}^{+} - I_{Q}^{+}) \overset{L}{⟶} N (0, V) as n \to \infty, \end{matrix}

(38)

where

\overset{L}{⟶}

refers to convergence in distribution, and

\begin{matrix} V = \frac{1 - Q (A)}{Q (A)} \end{matrix}

(39)

is the variance of the limiting normal distribution. The significance level of the test (36) for fine-tuning, with threshold

I_{\min}

, satisfies

\begin{matrix} lim_{n \to \infty} - \frac{log (P_{H_{0}} ({\hat{I}}_{n}^{+} \geq I_{\min}))}{n} = C, \end{matrix}

(40)

where

\begin{matrix} C = p_{\min} log \frac{p_{\min}}{P_{0} (A)} + (1 - p_{\min}) log \frac{1 - p_{\min}}{1 - P_{0} (A)} \end{matrix}

(41)

is the Kullback–Leibler divergence between Bernoulli distributions with success probabilities

p_{\min} = P_{0} (A) exp (I_{\min})

and

P_{0} (A)

, respectively.

Remark 1.

The conclusion of Proposition 4 is that the probability of observing actinfo that corresponds to fine-tuning by chance decays at rate

e^{- C n}

when the sample size n becomes large.

4.1.2. Parametric Estimator and Test

Suppose that there is a priori knowledge that P is close to the parametric exponential family

P

of distributions in (10)–(12) for some value

θ > 0

of the tilting parameter. A parametric test of actinfo is naturally defined. For this, first compute the maximum likelihood estimate

\begin{matrix} \hat{θ} = {\hat{θ}}_{n} = arg max_{θ \geq 0} \sum_{i = 1}^{n} log P_{θ} (X_{i}) \end{matrix}

(42)

of

θ

, and use it to define a parametric estimate

\begin{matrix} \hat{Q} (A) = P_{\hat{θ}} (A) \end{matrix}

(43)

of the target probability

Q (A)

that is inserted into (34) to define a parametric version of the empirical actinfo

{\hat{I}}^{+}

. As opposed to (37), the estimate (43) requires the full knowledge of f.

To analyze the properties of the estimator (34) and test (36), introduce

\begin{matrix} θ^{*} = arg min_{θ \geq 0} D_{K L} (Q ∥ P_{θ}), \end{matrix}

(44)

where

\begin{matrix} D_{K L} (Q ∥ P_{θ}) = \sum_{x \in Ω} Q (x) log \frac{Q (x)}{P_{θ} (x)} \end{matrix}

(45)

is the Kullback–Leibler divergence between Q and

P_{θ}

. From (44),

P_{θ^{*}}

is the distribution in

P

that best approximates Q. In particular,

θ^{*} = θ

if

Q \in P

and

Q = P_{θ}

for some

θ \geq 0

.

The following proposition shows that

{\hat{I}}^{+}

is an asymptotically normal estimator of

I^{+} (θ^{*})

in (13), which differs from

I_{Q}^{+}

in (35) whenever

Q \notin P

. Moreover, the proposition also provides large sample properties of the significance level of the test for actinfo (cf. Section 7 for details of the proof):

Proposition 5.

Suppose the empirical actinfo

{\hat{I}}_{n}^{+}

in (34) is computed parametrically, using an estimate (43) of the target probability

Q (A)

. Then,

{\hat{I}}_{n}^{+}

is an asymptotically normal estimator of

I^{+} (θ^{*})

, in the sense that

\begin{matrix} \sqrt{n} ({\hat{I}}_{n}^{+} - I^{+} (θ^{*})) \overset{L}{⟶} N (0, V) as n \to \infty, \end{matrix}

(46)

where the variance of the limiting normal distribution is given by

\begin{matrix} V = \frac{{Cov}_{P_{θ^{*}}}^{2} [f (X) I (f (X) \geq f_{0})] {Var}_{Q} [f (X)]}{P_{θ^{*}}^{2} (A) {Var}_{P_{θ^{*}}}^{2} [f (X)]} . \end{matrix}

(47)

Moreover, the significance level of the parametric test for fine-tuning, based on (36) and (43), satisfies

\begin{matrix} lim_{n \to \infty} - \frac{log [P_{H_{0}} ({\hat{I}}_{n}^{+} \geq I_{\min})]}{n} = C, \end{matrix}

(48)

for

\begin{matrix} C = sup_{ϕ > 0} \{ϕ E_{P_{\min}} [f (X)] - log M (ϕ)\}, \end{matrix}

(49)

where

P_{\min} = P_{θ_{\min}}

,

θ_{\min} < θ^{*}

is the solution of

P_{θ_{\min}} (A) = p_{\min} = P_{0} (A) exp (I_{\min})

,

M (ϕ)

is given by (11), whereas

p_{\min}

is defined in (33).

4.1.3. Comparison between Nonparametric and Parametric Estimates of Actinfo

The two versions of empirical actinfo are complementary. The nonparametric version is preferable in the sense that it makes less assumptions about the distribution P of the random algorithm under

H_{1}

, and in particular, it is a consistent estimator of

I_{Q}^{+}

in (35). The parametric version of

{\hat{I}}^{+}

, on the other hand, is preferable when

n Q (A)

is small, since it makes use of all data in order to estimate

Q (A)

, although it is not a consistent estimator of

I_{Q}^{+}

when

Q \notin P

. The asymptotic variances in (39) and (47), as well as the rates of exponential significance level decrease in (41) and (49), agree when

Q = P_{θ^{*}}

and

f (x) = f_{0} 1_{{x \in A}}

, which is a special case of (8).

4.2. Null Distribution Unknown

Suppose that the null distribution

P_{0} = P_{0 ξ}

involves an unknown nuisance parameter

ξ \in Ξ

. The objective is then to test the two hypotheses

\begin{matrix} \begin{matrix} H_{0} : & Q \in P_{0}, \\ H_{1} : & Q \in P_{1}, \end{matrix} \end{matrix}

(50)

where the set of distribution under the null and alternative hypotheses equals

P_{0} = {P_{0 ξ}; ξ \in Ξ}

(51)

and (33), respectively.

4.2.1. One Sample Available

The actinfo

I_{Q}^{+} = I_{Q}^{+} (ξ) = log \frac{Q (A)}{P_{0 ξ} (A)}

(52)

cannot be consistently estimated if only one sample (31) is available. The best course of action is thus to estimate a lower bound

{\hat{I}}^{+} = {\hat{I}}_{n}^{+} = log \frac{\hat{Q} (A)}{P_{0 \max} (A)}

(53)

of

I^{+}

, with

P_{0 \max} (A)

defined in (4) and

\hat{Q} (A)

an estimate of

Q (A)

. This estimator will have an asymptotic bias

B = I_{Q}^{+} (ξ^{*}) - I_{Q}^{+} = log \frac{P_{0 ξ} (A)}{P_{0 \max} (A)} \leq 0,

(54)

where

ξ^{*}

is the nuisance parameter that maximizes

P_{0 ξ} (A)

[43]. For the numerator of (53), either the nonparametric estimate of

Q (A)

in (37) can be used, or a parametric class

P = {P_{θ ξ}; θ \in Θ, ξ \in Ξ}

of distributions can be used that involves a tuning parameter vector

θ

and a vector of nuisance parameters

ξ

. If Q is thought to be close to

P

, the parametric estimate

\hat{Q} (A) = P_{\hat{θ} \hat{ξ}} (A)

(55)

of

Q (A)

is used, which generalizes (43), with

(\hat{θ}, \hat{ξ}) = arg max_{θ, ξ} \sum_{i = 1}^{n} log P_{θ ξ} (X_{i}) .

(56)

When the sample size n tends towards infinity, the estimator (56) will converge to

(θ^{*}, ξ^{*}) = arg min_{θ, ξ} D_{K L} (Q ∥ P_{θ ξ}) .

(57)

The following result is an extension of Propositions 4 and 5, when nuisance parameters

ξ

are added and a general type of tuning parameter

θ

(not necessarily a scalar tilting parameter) is used. A short proof of the proposition is offered in Section 7.

Proposition 6.

Suppose that the null distribution

P_{0} = P_{0 ξ}

involves an unknown parameter ξ and the actinfo

I_{Q}^{+}

in (52) is estimated by

{\hat{I}}_{n}^{+}

in (53), using an estimator

\hat{Q} (A)

of the target probability

Q (A)

that is either nonparametric (37) or parametric (55). Given these assumptions,

{\hat{I}}_{n}^{+}

is an asymptotically normal estimator, in the sense that

\sqrt{n} ({\hat{I}}_{n}^{+} - I_{Q}^{+} - B) \overset{L}{⟶} N (0, V) as n \to \infty .

(58)

The asymptotic bias B in (58) is defined in (54) whereas the asymptotic variance V is defined in (39) for the nonparametric estimator of

I_{Q}^{+}

, whereas

\begin{matrix} V & = & E [ψ_{θ^{*} ξ^{*}} (X) | X \in A] E {[ψ_{θ^{*} ξ^{*}}^{'} (X)]}^{- 1} E [ψ_{θ^{*} ξ^{*}}^{T} (X) ψ_{θ^{*} ξ^{*}} (X)] \\ \cdot & E {[{(ψ_{θ^{*} ξ^{*}}^{'})}^{T} (X)]}^{- 1} E {[ψ_{θ^{*} ξ^{*}} (X) | X \in A]}^{T} \end{matrix}

(59)

for the parametric estimator of

I_{Q}^{+}

, with

ψ_{θ ξ} (x) = d log P_{θ ξ} (x) / d (θ, ξ)

,

(θ^{*}, ξ^{*})

defined as in (57), and T refering to matrix transposition. Moreover, the significance level of the test (36) of FT, with threshold

I_{\min}

, satisfies

\begin{matrix} lim_{n \to \infty} - \frac{log [P_{0 ξ} ({\hat{I}}_{n}^{+} \geq I_{\min})]}{n} = C, \end{matrix}

(60)

with

C = p_{\min} e^{- B} log \frac{p_{\min} e^{- B}}{P_{0 ξ} (A)} + (1 - p_{\min} e^{- B}) log \frac{1 - p_{\min} e^{- B}}{1 - P_{0 ξ} (A)}

(61)

for the nonparametric version of the test, with

p_{\min} = P_{0 ξ} (A) exp (I_{\min})

. For the parametric versions of the FT-test, and in the special case when θ is a scalar exponential tilting parameter, C is given by (49), with

P_{\min} = P_{θ_{m i n} ξ}

, and

θ_{\min}

the solution of

P_{θ_{m i n} ξ} (A) = p_{\min} e^{- B}

.

Remark 2.

The negative bias term B makes the test of FT in Proposition 6 more conservative than the tests in Propositions 4 and 5. This can be seen, for instance, by comparing the two large deviation rates C in (41) and (61). The rate in (61) is larger, since

p_{\min}

is multiplied by a term

e^{- B}

. This corresponds to the fact that to falsely reject

H_{0}

in Proposition 6 is more difficult.

4.2.2. Two Samples Available

In addition to the first sample (31), suppose a second sample

X_{01}, \dots, X_{0 n_{0}} \sim P_{0 ξ}

(62)

of

n_{0}

i.i.d. observations under the null distribution is available. A consistent estimator

{\hat{I}}^{+} = {\hat{I}}_{n n_{0}}^{+} = log \frac{\hat{Q} (A)}{P_{0 \hat{ξ}} (A)}

(63)

of

I_{Q}^{+}

in (52) is then available, with

\hat{ξ} = arg max_{ξ} \sum_{i = 1}^{n_{0}} log P_{0 ξ} (X_{0 i}) .

(64)

The following result provides asymptotic properties of the estimator (63) of actinfo, and the corresponding test (36) of FT with threshold

I_{\min}

(cf. Section 7 for a proof):

Proposition 7.

Suppose that the null distribution

P_{0} = P_{0 ξ}

involves an unknown nuisance parameter ξ, and that the active information

I_{Q}^{+}

in (52) is estimated by

{\hat{I}}_{n n_{0}}^{+}

in (63), making use of two samples (31) and (62), of sizes n and

n_{0}

, from Q and

P_{0 ξ}

, respectively. Further assume that the estimator

\hat{Q} (A)

of

Q (A)

is either nonparametric (37) or parametric (55). If

n, n_{0} \to \infty

in such a way that

\frac{n}{n_{0}} \to λ > 0,

(65)

then

\sqrt{n} ({\hat{I}}_{n n_{0}}^{+} - I_{Q}^{+}) \overset{L}{⟶} N (0, V_{1} + λ V_{2}),

(66)

where

V_{2} = E [ψ_{ξ} (X) | X \in A] E {[ψ_{ξ}^{T} (X) ψ_{ξ} (X)]}^{- 1} E {[ψ_{ξ} (X) | X \in A]}^{T},

(67)

and

ψ_{ξ} (x) = d log P_{0 ξ} (x) / d ξ

. If the nonparametric estimator of

Q (A)

is used, then

V_{1}

equals V in (39), whereas if the parametric estimator

Q (A)

is used, then

V_{1}

equals V in (59). The significance level of the test (36) of FT, with threshold

I_{\min}

, satisfies the same type of large deviation result (60) as in Proposition 6, for the nonparametric and parametric versions of the test (in the latter case assuming that θ is a scalar tilting parameter), but in the definitions of the nonparametric and parametric large deviation rates C, the bias term

B = 0

.

5. Examples

In this section, we provide five examples. The first cosmology example is a continuation of Section 1.1, with specificity corresponding to a universe that permits life. The second example of student learning was introduced in Section 1.2, with specificity being the test score of a student who prepares for a test. The third example concerns reinforcement learning, with specificity the cumulative reward of a certain trajectory of actions and environments. The last two examples concern evolutionary algorithms for generating molecular machines, with specificity corresponding to the functionality or fitness of these machines. These evolutionary algorithms can be viewed as extensions or variants of the Metropolis–Hastings algorithms of Section 3.2, where proposed moves correspond to mutations, whereas accepted moves correspond to mutations that survive and then possibly spread to a whole population.

Example 1

(Cosmology [26,27]). Suppose that there is a positive constant of nature

X \in Ω = R^{+}

, a life-permitting interval

A \subset Ω

, and a specificity function (6) that equals 1 inside

A = (a, b)

and zero elsewhere. The maximum entropy distribution under a first moment constraint

ξ = E (X)

is exponential with expected value. Consequently,

P_{0 ξ} (A) = \frac{1}{ξ} \int_{a}^{b} e^{- x / ξ} d x .

The null and alternative hypotheses for the fine-tuning test are given in (50), where under

H_{1}

, the agent brings about a life-permitting value of X with probability 1 (

P (A) = 1

). Only one universe is observed, with a value

X = X_{1}

of the constant. Therefore, there is a sample (31) of size

n = 1

, whereas no null sample (62) is available. Since

X_{1} \in A

is life-permitting,

\hat{Q} (A) = 1

. The estimate (53) of actinfo then simplifies to

{\hat{I}}^{+} = log \frac{1}{P_{0 \max} (A)} = - log P_{0 \max} (A) .

(68)

Let

x = (a + b) / 2

be the midpoint of the LPI and suppose that half of its relative size

ϵ = (b - a) / (2 x)

is small. The probability in (68) is then approximated by

P_{0 \max} (A) \approx (b - a) max_{ξ > 0} \frac{e^{- x / ξ}}{ξ} \approx 2 ϵ e^{- 1} .

From (68), the estimated actinfo

{\hat{I}}^{+} \approx 1 - log (ϵ) - log (2)

is a monotone decreasing function of

ϵ

.

Example 2

(Evaluation of student test scores [44]). As a generalization of the example given in Section 1.2, suppose that a number of students perform a test. Let

x = (z, y) = (z_{1}, \dots, z_{d - 1}, y) \in R^{d}

summarize the chararcteristics of a student with covariates z that are used to predict the outcome y of the test. The specificity function

f (x) = x_{d} = y

equals the student’s test score, and (5) corresponds to the set of students that pass the test, with a minimally allowed score of

f_{0}

. The population of students follows a

(d - 1)

-dimensional multivariate normal distribution

Z \sim N (, Σ)

, where

= (m_{1}, \dots, m_{d - 1})

and

Σ = {(σ_{j k})}_{j, k = 1}^{d - 1}

are known. The conditional distribution of the response follows a multiple linear regression model

Y | Z = z \sim N (ξ_{0} + \sum_{j = 1}^{d - 1} ξ_{j} z_{j} + t (θ_{0} + \sum_{j = 1}^{d - 1} θ_{j} z_{j}), σ^{2}),

for a student with a covariate vector z who prepared for the test for a period of length t. The nuisance parameter vector

ξ = (ξ_{0}, \dots, ξ_{d - 1}, σ^{2})

involves the error variance and the regression parameters for students who did not train for the test, whereas the tuning parameter vector

θ = (θ_{0}, \dots, θ_{d - 1})

involves the regression parameters that correspond to the effect of preparing for the test. The unconditional distribution of the response is normal,

Y \sim N (μ, V)

, with

\begin{matrix} μ & = & μ (θ, ξ, t) = (ξ_{0} + t θ_{0}) + \sum_{j = 1}^{d - 1} (ξ_{j} + t θ_{j}) m_{j}, \\ V & = & V (θ, ξ, t) = σ^{2} + \sum_{j, k = 1}^{d - 1} (ξ_{j} + t θ_{j}) (ξ_{k} + t θ_{k}) σ_{j k} . \end{matrix}

Therefore, the probability that a randomly chosen student that studied for the test for a period of length t passes is

P (A) = P_{θ ξ t} (A) = P (Y \geq f_{0}) = 1 - Φ (\frac{f_{0} - μ}{\sqrt{V}}),

(69)

where

Φ

is the cumulative distribution function of a standard normal distribution. The null distribution

P_{0} = P_{0 ξ}

corresponds to putting

t = 0

in (69). Thus, the actinfo

I^{+} = I^{+} (θ, ξ, t) = log \frac{1 - Φ ((f_{0} - μ (θ, ξ, t) / \sqrt{V (θ, ξ, t)})}{1 - Φ ((f_{0} - μ (0, ξ, 0) / \sqrt{V (0, ξ, 0)})}

(70)

quantifies how much learning, during a period of length t, increases the probability of passing the test. To compute an estimate

{\hat{I}}^{+}

of

I^{+}

in (70), estimates

\hat{ξ}

and

\hat{θ}

of

ξ

and

θ

are needed. This can be achieved by collecting two training samples, as in (63). Another option is to compute the least squares estimates (

\hat{ξ}, \hat{θ}

) of the nuisance and the tuning parameters jointly, without bias, from one single dataset

{(t_{i}, z_{i}, y_{i})}_{i = 1}^{n}

, provided that the time periods

t_{i}

vary, so that all parameters are identifiable.

Example 3

(Reinforcement learning (RI) [45]). Consider an agent whose purpose is to maximize the reward

f (x)

of a trajectory x that they to some extent will be able to control, for a time period of length t. At each time point u, there are m possible environments

S = {s_{1}, \dots, s_{m}}

and q possible actions

A = {a_{1}, \dots, a_{q}}

to take. The state space

X = A^{t} \times S^{t + 1}

consists of all possible trajectories

x = (a_{0}, \dots, a_{t - 1}, s_{0}, \dots, s_{t})

of environments and actions, where

s_{u}

is the environment and

a_{u}

the action taken at time u. A corresponding random trajectory is denoted with capital letters

X = (A_{0}, \dots, A_{t - 1}, S_{0}, \dots, S_{t}) .

If the environment of the system is

S_{u} = s

at time u, and action

A_{u} = a

is taken, the probability of moving to environment

s^{'}

is

P_{a} (s, s^{'}) = P (S_{u + 1} = s^{'} | S_{u} = s, A_{u} = u)

, with an instantaneous reward of

R_{a} (s, s^{'})

. If future rewards are discounted by a factor

γ

, the total reward, over a time horizon of length t, is

f (x) = \sum_{u = 0}^{t} R_{a_{u}} (s_{u}, s_{u + 1}) γ^{u} .

Let

f_{0}

be a lower bound for a trajectory’s total discounted reward to be acceptable, so that A in (5) is the set of all acceptable trajectories. The agent takes action according to some policy to make the expected total reward of a trajectory as large as possible. To this end, consider stationary policies, where the action

A_{u}

taken by the agent at each time point u is only determined by the current environment

s_{u}

, according to some matrix

Π = (π (s, a); s \in S, a \in A)

of transition probabilities

π (s, a) = P (A_{u} = u | S_{u} = s)

. For a completely random policy

π (s, a) = ξ_{a}; a = 1, \dots, q,

the action is not influenced by the current environment, and it is completely specified by the vector

ξ = (ξ_{1}, \dots, ξ_{q})

of nuisance parameters. Thus,

P_{0} (A) = P_{0 ξ t} (f (X) \geq f_{0})

is the probability that an ignorant agent with policy determined by

ξ

, will have an acceptable trajectory. An agent who knows the reward function

R_{a}

and the dynamics

P_{a}

of the environment will try to take this knowledge into account to formulate a policy that makes the reward as large as possible. A deterministic policy

θ : S \to A

is a function that takes a unique action for each environment, so that

π (s, a) = 1_{{a = θ (s)}} .

Thus,

P (A) = P_{θ t} (f (X) \geq f_{0})

is the probability that an agent with deterministic policy

θ

obtains an acceptable trajectory. The active information

I^{+} = I^{+} (θ, ξ, t) = log \frac{P_{θ} (\sum_{u = 0}^{t} R_{A_{u}} (S_{u}, S_{u + 1}) γ^{u} \geq f_{0})}{P_{0 ξ} (\sum_{u = 0}^{t} R_{A_{u}} (S_{u}, S_{u + 1}) γ^{u} \geq f_{0})}

(71)

quantifies, on a logarithmic scale, how much more likely it is for an agent with policy

θ

to obtain an acceptable trajectory, compared to an ignorant agent with policy

ξ

. The values

ξ

and

θ

are varied during the exploration phase of RI, but they are assumed to be known during the exploitation phase of RI. Suppose that we want to compute the actinfo (71) during the exploitation phase. Since

P_{0} (A)

and

P (A)

are typically unknown, they have to be estimated by Monte Carlo. To this end, assume we have two samples (31) and (62) of n and

n_{0}

trajectories available, from

Q = P_{θ t}

and

Q = P_{0 ξ t}

, respectively. Then,

{\hat{I}}^{+}

in (63) can be used to estimate the actinfo (71).

Example 4

(Molecular machines and Moran models [15,30,41]). Suppose that

Ω

consists of all

2^{d}

binary sequences

x = (x_{1}, \dots, x_{d})

of length d, with a null distribution

P_{0} (x)

that will be chosen below. The specificity function f is defined as

\begin{matrix} f (x) = \{\begin{matrix} a | x |, & x \neq (1, \dots, 1), \\ 1, & x = (1, \dots, 1), \end{matrix} \end{matrix}

(72)

where

| x | = \sum_{i = 1}^{d} x_{i}

and

a \leq 1 / d

is a fixed parameter. We regard x as a molecular machine with d parts, with

x_{i} = 1

or 0 depending on whether part i functions or not. The specificity

f (x)

quantifies how well the machine works, for instance, its ability to regulate activity in vitro or in vivo in a living cell. It is assumed that

f (x)

is determined by the number

| x |

of functioning parts, with a maximal value

f_{\max} = f (1, \dots, 1) = 1

. Using (8), the most stringent definition of high specificity, it follows that

A = {(1, \dots, 1)}

only contains one element, a molecular machine for which all parts are in shape. The parameter a is crucial. If

0 < a \leq 1 / d

, it follows that a molecular machine works better the more the parts that are in shape. On the other hand, if

a < 0

, then a molecular machine with some parts in shape, but not all, functions worse the more parts are in shape, since all units must work in order for the whole machine to function, and there is a cost

- a

associated with carrying each part that is in shape, as long as the whole system does not function.

Each state x is interpreted as a population of N subjects, all having the same variant x of the molecular machine. With this interpretation,

X = X_{t}

is the outcome of a random evolutionary process where all subjects of the population, at any time point t, have the same state. However, this state may vary over time when all subjects of population simultaneously experience the same change. The question of interest is whether this process can modify the population so that all its members have a functioning molecular machine. A transition of this process from x is caused by a mutation with distribution

q (x, \cdot)

, where

q (x, x) = 0

. Suppose a mutation from x to y is possible, i.e.,

q (x, y) > 0

. A mutation from x to y first occurs in one individual and then it either (momentarily) dies out with probability

1 - α_{θ} (x, y)

or it (momentarily) spreads to the whole population (becomes fixed) with probability

\begin{matrix} α_{θ} (x, y) = C \cdot {(\frac{e^{θ f (y)} P_{0} (y) q (y, x)}{e^{θ f (x)} P_{0} (x) q (x, y)})}^{1 / 2}, \end{matrix}

(73)

where

\begin{matrix} C = {(max_{x, y} \frac{e^{θ f (y)} P_{0} (y) q (y, x)}{e^{θ f (x)} P_{0} (x) q (x, y)})}^{- 1 / 2} \end{matrix}

(74)

is a constant assuring that (73) never exceeds 1, and the maximum is taken over all

x, y

such that

x \neq y

and both of

q (x, y)

and

q (y, x)

are positive. The Markov chain with transition probabilities (14) and acceptance probability (73) represent the dynamics of the evolutionary process.

As shown in Section 7, the equilibrium distribution of this Markov chain is given by

P_{θ}

in (10). In particular, Propositions 2 and 3 remain valid when the Markov chain (14) with acceptance probabilities (73) are used, rather than (15). We will interpret

\begin{matrix} s (x) = e^{θ f (x) / N} \end{matrix}

(75)

as the selection coefficient or fitness of individuals with a molecular machine of type x, that is,

s (x)

is proportional to the fertility rate of individuals of type x.

The MH-type Markov chain with acceptance probability (73) and (74) represents an evolutionary process that closely resembles a Moran model with the selection [46,47,48], which is frequently used for describing evolutionary processes (as can be seen in Section 7). The Moran model is a continuous time Markov chain for a population with overlapping generations where individuals die at the same rate, and are replaced by the offspring of individuals in the population proportionally to their selection coefficients

s (x)

. New types arise when an offspring of parents of type x mutate with probability

μ (x)

. If the mutation rate is small (

μ (x) ≪ N^{- 1}

for all

x \in Ω

), then to a good approximation the whole population will have the same type at any point in time, which is a so-called fixed state assumption.

Even though the Moran model is specified in continuous time, time can be discretized as

t = 0, 1, 2, \dots

by only recording the population when individuals die. If individuals die at a rate of 1, then the next individual dies at a rate of N, so that time is counted in units of

N^{- 1}

generations. The fixed state assumption is motivated by assuming that newborn offspring with a new mutation either dies out or spreads to the whole population (becoming fixed in the population) right after birth. In this context, q corresponds to the way in which mutations change the type of the individual, whereas

α_{θ} = α_{θ N}

is the probability of fixation. If

q (x, y)

is the conditional probability that an offspring of a type x parent mutates to y, given that a mutation occurs, then the proposal kernel of the Moran model is

\begin{matrix} q^{Moran} (x, y) = \{\begin{matrix} μ (x) q (x, y), & x \neq y, \\ 1 - μ (x), & x = y . \end{matrix} \end{matrix}

(76)

As shown in Section 7, the acceptance (or fixation) probability of the Moran model is

\begin{matrix} α_{θ N}^{Moran} (x, y) \approx \frac{1}{N} (1 + \frac{θ [f (y) - f (x)]}{2}) \approx \frac{1}{N} {(\frac{e^{θ f (y)}}{e^{θ f (x)}})}^{1 / 2} \end{matrix}

(77)

when

θ [f (y) - f (x)]

is small. From (76) and (77), the Moran model approximates the Metropolis–Hastings kernel with acceptance probabilities (73) and (74) with good accuracy when (i)

μ (x) \equiv μ

; (ii)

P_{0}

is uniform; and (iii) the proposal kernel q is symmetric (i.e.,

q (x, y) = q (y, x)

), although the time scales of the two processes are different. More specifically, if (i)–(iii) hold, a time-shifted version of the Moran model approximates the MH-type model with acceptance probabilities (73) and (74), so that each time step of the MH-type Markov chain corresponds to

C / μ

generations of a Moran model. However, even under assumptions (i)–(iii), the stationary distribution of the Moran model differs slightly from

P_{θ}

.

The proposal kernel

q (x, y)

is assumed to be local and satisfying

\begin{matrix} q (x, y) = \{\begin{matrix} b / [| x | + b (d - | x |)], & y = x + e_{j}, x_{j} = 0, \\ 1 / [| x | + b (d - | x |)], & y = x + e_{j}, x_{j} = 1, \\ 0, & otherwise, \end{matrix} \end{matrix}

(78)

where

e_{j} = (0, \dots, 0, 1, 0, \dots, 0)

is a row vector of length d with a 1 in position

j \in {1, \dots, d}

and zeros elsewhere, whereas

x + e_{j}

refers to component-wise addition modulo 2, corresponding to a switch of component j of x. A change of component j from 0 to 1 is caused by a beneficial mutation, whereas a change from 1 to 0 corresponds to a deleterious mutation. Consequently,

b > 0

is the ratio between the rates at which beneficial and deleterious mutations occur.

The kernel q in (78) is symmetric only when beneficial and deleterious mutations have the same rate (

b = 1

). The more general case of asymmetric q is handled differently by the MH-type algorithm and the Moran model. Whereas the MH-type algorithm elevates the acceptance probability (73) of seldom-proposed states y (those y for which

q (x, y)

is small for many x), this is not the case for the acceptance probability (77) of the Moran model. To avoid that these states y are reached too often by the MH-type algorithm, the null distribution

P_{0}

of no selection has to be chosen so that

P_{0} (y)

is small for rarely proposed states (whereas the Moran model needs no such correction). Therefore

P_{0}

in (73) will be chosen as the stationary distribution of a transition kernel (14) for which

θ = 0

and all candidates are accepted (

α_{0} (x, y) = 1

). That is, if

{\tilde{Π}}_{0}

refers to the transition matrix of such a Markov chain, the initial distribution

P_{0}

in (17) is chosen as the solution of

\begin{matrix} \{\begin{matrix} P_{0} = P_{0} {\tilde{Π}}_{0}, \\ \sum_{x \in Ω} P_{0} (x) = 1 . \end{matrix} \end{matrix}

(79)

The null distribution

P_{0} = P_{0 b}

in (79) involves one single nuisance parameter

ξ = b

. In the special case, when beneficial and deleterious mutations have the same rate (

b = 1

), this procedure generates a uniform distribution

P_{0} (x) \equiv 2^{- d}

. On the other hand, states x with many functioning parts will be harder to reach by the Markov process

{\tilde{Π}}_{0}

when beneficial mutations occur less frequently than deleterious ones (

0 < b < 1

), resulting in smaller values of

P_{0} (x)

. The distribution under the alternative hypothesis,

P = P_{\tilde{θ} b t}

, involves the nuisance parameter b, the time point t at which the state of the population is recorded, and

\tilde{θ} = (a, θ)

, the two parameters that determine how much background information the MH-type evolutionary algorithm makes use of. For simplicity, a and b are here regarded as constants and we only include

θ

and t in the notation. This gives rise to an active information

I^{+} (θ, t) = log \frac{P_{θ} (X_{t} = (1, \dots, 1))}{P_{0} (X_{t} = (1, \dots, 1))} .

(80)

The MH-type algorithm is studied for

d = 5

, and illustrated in Figure 1, Figure 2 and Figure 3. Note that the functional information

I_{f 0}

is a decreasing function of b, since it is more surprising to find a working molecular machine by chance when the rate of beneficial mutations b is small. Moreover, the active information

I^{+} (θ) = {lim}_{t \to \infty} I^{+} (θ, t)

for the equilibrium distribution of the Markov chain as well as the active information

I^{+} (θ, t)

and

I_{s}^{+} (θ, t)

for a system in non-equilibrium, without and with stopping, are increasing functions of

θ

, and decreasing functions of a and b. The smaller a or b is, the more external information can be infused to increase the probability of reaching the fine-tuned state of a working molecular machine

(1, \dots, 1)

. When a is small, to leave this state once it is reached becomes more difficult, and consequently

I_{s}^{+} (θ, t)

, is only marginally larger than

I (θ, t)

.

Example 5

(Evolutionary programming algorithms). Suppose that

Ω = Ω_{ind}^{N}

is a set of genetic variants from some genomic region,

x = (x_{1}, \dots, x_{N})

, for the members of a population of size N. That is,

x_{k} \in Ω_{ind}

is the variant of this genomic region for individual k. If, for instance, the region codes for the molecular machine of Example 4, we let

x_{k} = (x_{k 1}, \dots, x_{k d}) \in {0, 1}^{d} = Ω_{ind}

, with

x_{k j} = 1

or 0 depending on whether component j of this machine works for individual k. Let

g (x_{k})

be the biological fitness, or the expected number of offspring, of k. In the context of molecular machines, the logarithm of

g (x_{k})

could be a function of the number of functioning parts of a machine of type

x_{k}

. The specificity function of a population in state x is the average fitness

f (x) = \frac{1}{N} \sum_{k = 1}^{N} g (x_{k})

of its individuals. The targeted set A in (5) corresponds to all genetic profiles with an average fitness at least

f_{0}

. This type of model is frequently used in genetic programming as well as in other types of evolutionary programming algorithms to mimic the evolution of N individuals over time [49,50]. Typically, the output

X = X_{t}

of the evolutionary algorithm is the last step of a simulation

{X_{s} = (X_{s 1}, \dots, X_{s N})}_{s = 0}^{t}

of the population over t generations. Once the distributions

P_{0} = P_{0 ξ t}

and

P = P_{θ ξ t}

of X are found under the null hypothesis

H_{0}

and the alternative hypothesis

H_{1}

, the actinfo

I^{+}

can be computed, according to (1). This actinfo quantifies, on a logarithmic scale, how much more likely it is for the average fitness of the population to exceed

f_{0}

at time t, for a population with externally infused information (

H_{1}

) compared to an evolutionary process where no such external information is used (

H_{0}

). For instance, if a molecular machine needs all its parts in order to function (

g (x_{k}) = 1 (| x_{k} | = d)

), then the actinfo at time t equals

I^{+} = I^{+} (θ, ξ, t) = log \frac{P_{θ ξ t} (| {k; 1 \leq k \leq N, X_{k} = (1, \dots, 1)} | \geq f_{0} N)}{P_{0 ξ t} (| {k; 1 \leq k \leq N, X_{k} = (1, \dots, 1)} | \geq f_{0} N)},

(81)

with

X = (X_{1}, \dots, X_{N})

. Since the state space

Ω

is very large, it is often complicated to find explicit, analytical expressions for the actinfo

I^{+}

in (81). Suppose that the nuisance parameters

ξ

of the null distribution

P_{0} = P_{0 ξ}

are known. This makes the framework of Section 4.1 applicable, running the evolutionary algorithm n times. That is, n i.i.d. copies

{X_{i s}}_{s = 0}^{t}

of the population trajectory are generated up to time t for

i = 1, \dots, n

. Then,

X_{i} = X_{i t} = (X_{i t 1}, \dots, X_{i t N})

,

i = 1, \dots, n

, are used for computing an estimate

{\hat{I}}_{n}^{+}

of the actinfo, and test for fine-tuning, according to Section 4.1.

Recall the fixed state assumption of Example 4, whereby all individuals of the population, at any time point, have the same state. Such an assumption is only realistic when

N μ ≪ 1

, that is, when either the mutation rate

μ

and/or the population size N is small. This corresponds to a scenario where

P_{0}

and P put all their probability masses along the diagonal

\begin{matrix} Ω_{diag} = {x \in Ω; x_{1} = \dots = x_{N}} \end{matrix}

(82)

of

Ω

. Since (82) is equivalent to the reduced state space

Ω_{ind}

, the fixed state assumption greatly simplifies the analysis. For instance, it often makes it possible to find analytical expressions for the actinfo

I^{+}

, rather than having to estimate it.

6. Discussion

In this article, a general statistical framework is provided for using active information to quantify the amount of pre-specified external knowledge an algorithm makes use of, or equivalently, how tuned the algorithm is. The theory is based on quantifying, for each state x, how specified it is by means of a real-valued function

f (x)

. An algorithm with external information either directly makes use of knowledge of f, or at least it incorporates knowledge that tends to move the output of the algorithm towards more specified regions. The Metropolis–Hastings Markov chain directly incorporates knowledge of f in terms of the acceptance probability of proposed moves. The learning ability of this algorithm was analyzed by studying its active information, with or without stopping, when the targeted set of highly specified states is reached. When the independent outcomes of an algorithm are available, nonparametric and parametric estimators of the actinfo of the algorithm were also developed, as well as nonparametric and parametric tests of FT.

This work can be extended in different ways. A first extension is to find conditions under which the actinfo

I^{+} (θ, t)

of a stochastic algorithm based on a random start (according to the null distribution of a non-guided algorithm) followed by t iterations of the Metropolis–Hastings Markov chain (without stopping) is a non-decreasing function of t. We conjecture that this is typically the case but have not obtained any general conditions on the distribution q of proposed candidates for this result to hold.

A second extension is to widen the notion of specificity, so that not only the functionality

f (x)

but also the rarity

P_{0} (x)

of the outcome x under the null distribution is taken into account. A class of such specificity functions is

\begin{matrix} g_{θ} (x) = θ f (x) - log P_{0} (x), \end{matrix}

(83)

where

θ > 0

is a parameter that controls the tradeoff between scenarios where either functionality or rarity under the null is the most important determinant of specificity. The case

θ = 0

in (83) corresponds to the function having no impact, so that

g_{0} (x)

reduces to Shannon’s self information of x. The case

g_{1} (x)

was proposed in [15], whereas

g_{θ} (x)

is solely determined by

f (x)

in the limit when

θ

becomes large.

A third extension is to generalize the notion of actinfo to include not only the probability of reaching a targeted set of highly specified states A under

H_{0}

and

H_{1}

, but also account for the conditional distribution of the states within A, given that A has been reached. This is related to the way in which functional sequence complexity generalizes the functional information [51,52,53,54]. Let

H (Q) = - \sum_{x} Q (x) log [Q (x)]

refer to the Shannon entropy of a distribution Q, whereas

H (Q_{A})

is the Shannon entropy of the corresponding conditional distribution

Q_{A} (x) = Q (x | A)

, given that A has been reached. The functional sequence complexity

\begin{matrix} {FSC}_{0} & = H (P_{0}) - H (P_{0 A}) \\ = E_{P_{0}} \{log [P_{0} (X ∣ A)] ∣ X \in A\} - E_{P_{0}} {log [P_{0} (X)]} \end{matrix}

is the reduction in entropy, under the null hypothesis

H_{0}

of the highly specified states in A, compared to the entropy under

H_{0}

of all states in

Ω

.

{FSC}_{0}

then reduces to the functional information

I_{f 0}

when

P_{0}

is uniform over

Ω

. In a similar vein, the active uncertainty reduction is introduced:

\begin{matrix} {UR}^{+} & = \sum_{x \in A} P_{A} (x) log P (x) - \sum_{x \in A} P_{0 A} (x) log P_{0} (x) \\ = E_{P} [log P (X) | X \in A] - E_{P_{0}} [log P_{0} (X) | X \in A] . \end{matrix}

Then,

{UR}^{+} = I^{+}

when

P_{0 A}

and

P_{A}

are uniformly distributed on A. This happens, for instance, when

P_{0}

has a uniform distribution on

Ω

and

P = P_{θ}

for some

θ > 0

, and if (8) holds. The properties of

{U R}^{+}

deserve to be analyzed in more detail, for instance, by investigating how it differs from the actinfo

I^{+}

.

A fourth extension would be to apply the concept of actinfo to other genetic models. For instance, Example 4 is the first time that, to our knowledge, actinfo is applied to the Moran model. In the past, however, actinfo was used in population genetics to study fixation times for the Wright–Fisher model of population genetics, a model for which time is discrete and generations do not overlap [55].

7. Proofs

Proof of Proposition 1.

Introduce

\begin{matrix} J (θ) & = \sum_{x \in A^{c}} exp {θ [f (x) - f (x_{0})]} P_{0} (x), \\ K (θ) & = \sum_{x \in A} exp {θ [f (x) - f (x_{0})]} P_{0} (x), \end{matrix}

(84)

when

Ω

is countable, and replace the sums in (84) by integrals when

Ω

is continuous. Then

\begin{matrix} P_{θ} (A) & = exp [θ f (x_{0})] K (θ) / {exp (θ f (x_{0})) [J (θ) + K (θ)]} \\ = K (θ) / [J (θ) + K (θ)] \\ = 1 / [J (θ) / K (θ) + 1] . \end{matrix}

(85)

Since

P_{0} (A) < 1

, it follows that

J (θ)

is a strictly decreasing function of

θ \geq 0

, whereas

K (θ)

is a non-decreasing function of

θ

. From this, it follows that

P_{θ} (A)

is a strictly increasing function of

θ

, and consequently

I^{+} (θ) = log [P_{θ} (A) / P_{0} (A)]

is a strictly increasing function of

θ

as well.

Moreover,

K (θ) \geq P_{0} (A) > 0

for all

θ \geq 0

, and

J (θ) \to 0

as

θ \to \infty

follows by dominated convergence. In conjunction with (85), this implies that

P_{θ} (A) \to 1

and

I^{+} (θ) \to I_{f 0}

as

θ \to \infty

. □

Proof of Proposition 2.

Equation (20) follows from (17), (19) and the fact that

\begin{matrix} P_{0} (A) & = \sum_{x \in A} P_{0} (x) = P_{0} v, \\ P_{θ t} (A) & = \sum_{x \in A} P_{θ t} (x) = P_{θ t} v s . = P_{0} Π_{θ}^{t} v, \end{matrix}

since

v

is a column vector of length

| Ω |

with ones in positions

x \in A

and zeros in positions

x \in A^{c}

.

Equation (21) is equivalent to proving that

\begin{matrix} P_{θ t} (A) \to P_{θ} (A) as t \to \infty . \end{matrix}

However, this follows from the fact that

P_{θ}

is the equilibrium distribution of the Markov chain with transition kernel (18). That is, letting

t \to \infty

in (19), we find that

\begin{matrix} P_{θ t} = P_{0} Π_{θ}^{t} \to P_{θ}, \end{matrix}

and therefore

P_{θ t} (A) = P_{θ t} v s . \to P_{θ} v s . = P_{θ} (A), as t \to \infty .

□

Proof of Proposition 3.

Equation (28) follows from the definitions of

I^{+} (θ, t)

and

I_{s}^{+} (θ, t)

in (20) and (27), and the fact that

P_{θ t} (A) = P (X_{t} \in A) \leq P (X_{t \land T} \in A) = P_{θ t s} (A),

where the inequality is a consequence of the definition of T in (22). Since

P_{θ t s} (A) = P (T \leq t) \leq P (T \leq t + 1) = P_{θ, t + 1, s} (A),

we proved that

I_{s}^{+} (θ, t)

is non-decreasing in t. Equation (29) follows from the definition of

I_{s}^{+} (θ, t)

and the fact that

lim_{t \to \infty} P_{θ t s} (A) = P (T < \infty) = 1 .

(86)

The last equality of (86) is a consequence of the fact that the Markov chain with transition kernel

Π_{θ}

is irreducible, so that any state

x \in Ω

will be reached with a probability of 1. In particular, the targeted set A will be reached with a probability of 1. In order to verify (30), we first deduce

P (T > t) = 1 - P_{0} (A) e^{I_{s}^{+} (θ, t)}

from (24), and then we make use of the equality

E (T) = \sum_{t = 0}^{\infty} P (T > t) .

□

Proof of Proposition 4.

Since

n \hat{Q} (A) \sim Bin (n, Q (A))

has a binomial distribution, it follows from the central limit theorem that

\sqrt{n} (\hat{Q} (A) - Q (A)) \overset{L}{⟶} N (0, Q (A) [1 - Q (A)]),

(87)

as

n \to \infty

. Notice that

{\hat{I}}^{+} = g (\hat{Q} (A))

, where

g (Q) = log [Q / P_{0} (A)]

and

g^{'} (Q) = 1 / Q

. Equation (38) follows from the Delta method (see, e.g., Theorem 8.12 of [33]) and the fact that

V = g^{'} {(Q (A))}^{2} \cdot Q (A) [1 - Q (A)] .

In order to establish (40), to begin with, it follows from (34) and the definition of

p_{\min}

that

\begin{matrix} P_{H_{0}} ({\hat{I}}^{+} \geq I_{\min}) & = P_{H_{0}} (\hat{Q} (A) \geq p_{\min}) \\ = P_{H_{0}} (\frac{1}{n} \sum_{i = 1}^{n} Y_{i} \geq p_{\min}), \end{matrix}

where

Y_{i} = I (X_{i} \in A) \sim B e (p_{0})

are independent Bernoulli variables under

H_{0}

with success probability

p_{0} = P_{0} (A)

. It follows from the large deviations theory that (40) holds, with

C = sup_{ϕ > 0} [ϕ p_{\min} - λ (ϕ)]

(88)

the Legendre–Fenchel transformation, and

λ (ϕ) = log E [exp (ϕ Y)] = log [1 + p_{0} (e^{ϕ} - 1)]

(89)

the cumulant generating function of Y [56], pp. 529–533. Inserting (89) into (88), it can be seen that the maximum in (88) is given by (41). □

Proof of Proposition 5.

In order to verify (46), we will first show that the estimator (42) of the tilting parameter

θ

is asymptotically normal

\sqrt{n} ({\hat{θ}}_{n} - θ^{*}) \overset{L}{⟶} N (0, U) a s n \to \infty,

(90)

with asymptotic variance

U = \frac{{Var}_{Q} [f (X)]}{{Var}_{P_{θ^{*}}}^{2} [f (X)]} .

(91)

To this end, let

^{'}

refer to the derivatives with respect to the tilting parameter

θ

. Define the score function

ψ_{θ} (x) = \frac{d log P_{θ} (x)}{d θ} = \frac{P_{θ}^{'} (x)}{P_{θ} (x)}

and its derivative

ψ_{θ}^{'} (x) = \frac{d ψ_{θ} (x)}{d θ} .

It is a standard result from the asymptotic theory of maximum likelihood estimation and M-estimation (see, e.g., Chapter 6 of [33]) that (90) holds with asymptotic variance

U = \frac{{Var}_{Q} [ψ_{θ^{*}} (X)]}{E_{Q}^{2} [ψ_{θ^{*}}^{'} (X)]} .

(92)

To simplify (92), notice that the score function can be written as

ψ_{θ} (x) = f (x) - \frac{M^{'} (θ)}{M (θ)} = f (x) - E_{P_{θ}} [f (X)]

(93)

for the exponential family of tilted distributions (10) and (11). From this, it follows that

ψ_{θ}^{'} (x) = \frac{M^{''} (θ)}{M (θ)} - {(\frac{M^{'} (θ)}{M (θ)})}^{2} = {Var}_{P_{θ}} [f (X)]

is a constant, not depending on x. Inserting the last two displayed equations into (92), the formula in (91) for the asymptotic variance of

\hat{θ}

is obtained. As a next step, we notice that

{\hat{I}}^{+} = g (\hat{θ}),

(94)

where

g (θ) = log \frac{P_{θ} (A)}{P_{0} (A)} = log h (θ) - log P_{0} (A),

(95)

and

h (θ) = P_{θ} (A) = \frac{\sum_{x \in A} e^{θ f (x)} P_{0} (x) d x}{M (θ)}

(96)

follows from the definition of

P_{θ} (x)

in (10).

Differentiating (96) with respect to

θ

, we find that

\begin{matrix} \begin{matrix} h^{'} (θ) & = \sum_{x \in A} f (x) e^{θ f (x)} P_{0} (x) d x / M (θ) \\ - M^{'} (θ) \sum_{x \in A} e^{θ f (x)} P_{0} (x) d x / M^{2} (θ) . \end{matrix} \end{matrix}

(97)

Furthermore, it follows from the RHS of (97) that

\begin{matrix} \begin{matrix} h^{'} (θ) & = E_{P_{θ}} [f (X) I (f (X) \geq f_{0})] - P_{θ} (A) E_{P_{θ}} [f (X)] \\ = {Cov}_{P_{θ}} [f (X), I (f (X) \geq f_{0})] . \end{matrix} \end{matrix}

(98)

Then, we combine (95) and (97), and obtain

g^{'} (θ) = \frac{h^{'} (θ)}{h (θ)} = \frac{{Cov}_{P_{θ}} [f (X), I (f (X) \geq f_{0})]}{P_{θ} (A)} .

(99)

Finally, we use the Delta method to conclude that

{\hat{I}}^{+}

is an asymptotic normal estimator (38) of

I^{+} (θ^{*})

, with asymptotic variance

V = g^{'} {(θ^{*})}^{2} U

, which, in view of (91) and (99), agrees with (47).

In order to prove the large deviation result (48) for the parametric test of FT, let

θ_{\min}

be the value of the tilting parameter that satisfies

P_{θ_{\min}} (A) = p_{\min} = P_{0} (A) exp (I_{\min})

. Then, notice that

\begin{matrix} P_{H_{0}} ({\hat{I}}^{+} \geq I_{\min}) & = P_{H_{0}} (\hat{Q} (A) \geq p_{\min}) \\ = P_{H_{0}} (\hat{θ} \geq θ_{\min}) \\ = P_{H_{0}} (\sum_{i = 1}^{n} ψ_{θ_{\min}} (X_{i}) / n \geq 0) \\ = P_{H_{0}} (\sum_{i = 1}^{n} f (X_{i}) / n \geq E_{p_{\min}} [f (X)]), \end{matrix}

where, in the third step, we utilized that

\hat{θ} \geq θ_{\min}

is equivalent to the derivative of the log likelihood of data being non-negative at

θ_{\min}

, and in the fourth step, we made use of (93) and introduced

p_{\min} = P_{θ_{\min}}

. However, this last line is a large deviations probability. It then follows from a large deviations theory that (48) holds, with C the Legendre–Fenchel transformation in (49). □

Proof of Proposition 6.

Since the bias corrected empirical actinfo

{\hat{I}}_{n}^{+} - B = log \frac{\hat{Q} (A)}{P_{0 ξ} (A)}

(100)

behaves like (34), with

P_{0} = P_{0 ξ}

, the asymptotic normality result for the nonparametric version of the estimator of

I_{Q}^{+}

follows from Proposition 4.

For the parametric version of the estimator of

I_{Q}^{+}

, we will (briefly) generalize the asymptotic normality proof of Proposition 5. It follows from (53) and (55) that

{\hat{I}}_{n}^{+} = g (\hat{θ}, \hat{ξ}),

where

g (θ, ξ) = log \frac{P_{θ ξ} (A)}{P_{0 \max} (A)} .

(101)

Making use of the delta method, it follows that the asymptotic variance of the parametric version of

{\hat{I}}_{n}^{+}

equals

V = g^{'} (θ^{*}, ξ^{*}) AsVar (\hat{θ}, \hat{ξ}) g^{'} {(θ^{*}, ξ^{*})}^{T},

(102)

with the asymptotic variance of

(\hat{θ}, \hat{ξ})

defined through

\sqrt{n} ((\hat{θ}, \hat{ξ}) - (θ^{*}, ξ^{*})) \overset{L}{⟶} N (0, AsVar (\hat{θ}, \hat{ξ}))

as

n \to \infty

. Since

(\hat{θ}, \hat{ξ})

in (56) is an M-estimator, it follows that its asymptotic variance equals

AsVar (\hat{θ}, \hat{ξ}) = E {[ψ_{θ^{*} ξ^{*}}^{'} (X)]}^{- 1} E [ψ_{θ^{*} ξ^{*}}^{T} (X) ψ_{θ^{*} ξ^{*}} (X)] E {[{(ψ_{θ^{*} ξ^{*}}^{'})}^{T} (X)]}^{- 1} .

(103)

The gradient of (101) is

g^{'} (θ, ξ) = \frac{P_{θ ξ}^{'} (A)}{P_{θ ξ} (A)} = E [ψ_{θ ξ} (X) | X \in A],

(104)

where

ψ_{θ ξ} = P_{θ ξ}^{'} (x) / P_{θ ξ} (x)

is the likelihood score function for the combined parameter vector

(θ, ξ)

. Putting things together, the asympotic variance formula (59) for the parametric version of

{\hat{I}}_{n}^{+}

follows from (102)–(104).

The significance level of the FT test can be written as

P_{0 ξ} ({\hat{I}}_{n}^{+} \geq I_{\min}) = P_{0 ξ} ({\hat{I}}_{n}^{+} - B \geq I_{\min} - B) .

Since

p_{\min} = P_{0 ξ} (A) exp (I_{\min})

, we have that

I_{\min} - B = log \frac{p_{\min} e^{- B}}{P_{0 ξ} (A)} .

(105)

From this and (100), it follows that the nonparametric test of FT behaves as the corresponding nonparametric test of Proposition 4, with the null probability

P_{0} (A)

replaced by

P_{0 ξ} (A)

, and

p_{\min}

replaced by

p_{\min} e^{- B}

. Therefore, the large deviation result (61) follows from (41). In a similar way, the large deviation result for the parametric version of the FT-test (in the special case when

θ

is a scalar exponential tilting parameter) follows from (100), (105) and Proposition 5. □

Proof of Proposition 7.

Because of (52) and (63), we have that

\sqrt{n} ({\hat{I}}_{n n_{0}}^{+} - I_{Q}^{+}) = \sqrt{n} log \frac{\hat{Q} (A)}{Q (A)} - \sqrt{\frac{n}{n_{0}}} \sqrt{n_{0}} log \frac{P_{0 \hat{ξ}} (A)}{P_{0 ξ} (A)},

(106)

where

\sqrt{n} log \frac{\hat{Q} (A)}{Q (A)} \overset{L}{⟶} N (0, V_{1}) as n \to \infty

(107)

and

\sqrt{n_{0}} log \frac{P_{0 \hat{ξ}} (A)}{P_{0 ξ} (A)} \overset{L}{⟶} N (0, V_{2}) as n_{0} \to \infty

(108)

respectively. It follows from the proofs of Propositions 4 and 5 that the asymptotic variance for

V_{1}

in (107) is the same as V in (39) and (59), for the nonparametric and parametric versions of

\hat{Q} (A)

, respectively. The asymptotic variance

V_{2}

in (108) is given by (67). This is proven using the delta method (similarly as for Proposition 6), making use of the fact that

\hat{ξ}

is the maximum likelihood estimator of

ξ

with asymptotic variance that is the inverse

E {[ψ_{ξ}^{T} (X) ψ_{ξ} (X)]}^{- 1}

of the Fisher information matrix. The asymptotic normality result (66) then follows from (106)-(108), the fact that

n / n_{0} \to λ

, and the independence of the two samples.

The large deviations results are proven in a similar way as in Proposition 6, replacing

P_{0 \max} (A)

by

P_{0 \hat{ξ}} (A)

. Using statistical consistency

\hat{ξ} \overset{p}{⟶} ξ

as

n_{0} \to \infty

, it follows that the large deviation rates C of Proposition 7, for the nonparametric and parametric versions of the FT tests, are the same as in Proposition 6, with bias term

B = 0

. □

Details from Example 4. In order to prove that the Metropolis–Hastings-type Markov chain (14) with acceptance probabilities (73) has an equilibrium distribution of

P_{θ}

, we first notice that for any pair of states

x \neq y

, the flow of probability mass

\begin{matrix} P_{θ} (x) π_{θ} (x, y) \\ = P_{θ} (x) q (x, y) α_{θ} (x, y) \\ = \frac{P_{0} (x) e^{θ f (x)}}{M (θ)} q (x, y) \cdot C {[\frac{e^{θ f (y)} P_{0} (y) q (y, x)}{e^{θ f (x)} P_{0} (x) q (x, y)}]}^{1 / 2} \\ = C \frac{{(e^{θ f (x)} P_{0} (x) q (x, y) e^{θ f (y)} P_{0} (y) q (y, x))}^{1 / 2}}{M (θ)} \end{matrix}

(109)

from x to y is symmetric with respect to x and y. Therefore, the flow

P_{θ} (y) π_{θ} (y, x)

of probability mass in the opposite direction, from y to x, is the same as in (109). A Markov chain with this property is called reversible [57], pp. 11–12. However, it is well known that

P_{θ}

is a stationary distribution if the Markov chain is reversible with reversible measure

P_{θ}

[58], p. 238. If, additionally, the proposal distribution q is such that it is possible to move between any pair of states in a finite number of steps, it follows that the Markov chain is irreducible and hence that

P_{θ}

is its unique stationary distribution, which is also the equilibrium distribution of the Markov chain [58], p. 232.

We will then motivate formula (77) for the acceptance probability of a Moran model. Assume that the population evolves over time as a Moran model, and that all individuals have type x. If one individual mutates from x to y, because of (75), the relative fitness between the

N - 1

individuals of type x and the newly mutated individual of type y is

s = \frac{e^{θ f (y) / N}}{e^{θ f (x) / N}} = e^{θ [f (y) - f (x)] / N} .

(110)

From the theory of Moran models (e.g., [41,59]), it is well known that the fixation probability for the newly mutated individual is

β_{N} (s) = \{\begin{matrix} (1 - s^{- 1}) / (1 - s^{- N}), & s \neq 1, \\ 1 / N, & s = 1 . \end{matrix}

(111)

Inserting (110) into (111), we find (when

s \neq 1

, or equivalently when

Δ = θ [f (y) - f (x)] \neq 0

) that

β_{N} (s) = \frac{1 - e^{- Δ / N}}{1 - e^{- Δ}} \approx \frac{1}{N} \cdot \frac{Δ}{1 - e^{- Δ}} \approx \frac{1}{N} \cdot (1 + \frac{Δ}{2}),

which is equivalent to (77).

Author Contributions

D.A.D.-P. and O.H. contributed equally to all parts of the manuscript, including conceptualization, methodology, writing, review, and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors want to thank two anonymous reviewers for valuable comments that considerably improved the quality of the paper. SDG.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gödel, K. Über Formal Unentscheidbare Sätze der Principia Mathematica und Verwandter Systeme, I. Monatshefte Math. Phys. 1931, 38, 173–198. [Google Scholar] [CrossRef]
Hofstadter, D.R. Gödel, Escher, Bach: An Ethernal Golden Braid; Basic Books: New York, NY, USA, 1999. [Google Scholar]
Whitehad, A.N.; Russell, B. Principia Mathematica; Cambridge University Press: Cambridge, UK, 1927. [Google Scholar]
Wolpert, D.H.; MacReady, W.G. No Free Lunch Theorems for Search; Technical Report SFI-TR-95-02-010; Santa Fe Institute: Santa Fe, NM, USA, 1995. [Google Scholar]
Wolpert, D.H.; MacReady, W.G. No Free Lunch Theorems for Optimization. IEEE Trans. Evol. Comput. 1997, 1, 67–82. [Google Scholar] [CrossRef]
Wolpert, D.H. What is important about the No Free Lunch theorems? In Black Box Optimization, Machine Learning and No-Free Lunch Theorems; Pardalos, P.M., Rasskazova, V., Vrahatis, M.N., Eds.; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
Dembski, W.A.; Marks, R.J., II. Bernoulli’s Principle of Insufficient Reason and Conservation of Information in Computer Search. In Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics, San Antonio, TX, USA, 11–14 October 2009; pp. 2647–2652. [Google Scholar] [CrossRef]
Dembski, W.A.; Marks, R.J., II. Conservation of Information in Search: Measuring the Cost of Success. IEEE Trans. Syst. Man, Cybern. Part Syst. Hum. 2009, 5, 1051–1061. [Google Scholar] [CrossRef]
Hazen, R.M.; Griffin, P.L.; Carothers, J.M.; Szostak, J.W. Functional information and the emergence of biocomplexity. Proc. Natl. Acad. Sci. USA 2007, 104, 8574–8581. [Google Scholar] [CrossRef] [PubMed]
Szostak, J.W. Functional information: Molecular messages. Nature 2003, 423, 689. [Google Scholar] [CrossRef] [PubMed]
Díaz-Pachón, D.A.; Marks, R.J., II. Generalized active information: Extensions to unbounded domains. BIO-Complexity 2020, 2020, 1–6. [Google Scholar] [CrossRef]
Díaz-Pachón, D.A.; Sáenz, J.P.; Rao, J.S.; Dazard, J.E. Mode hunting through active information. Appl. Stoch. Model. Bus. Ind. 2019, 35, 376–393. [Google Scholar] [CrossRef]
Liu, T.; Díaz-Pachón, D.A.; Rao, J.S.; Dazard, J.E. High Dimensional Mode Hunting Using Pettiest Component Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2022, accepted. [Google Scholar] [CrossRef]
Montañez, G.D. The famine of forte: Few search problems greatly favor your algorithm. In Proceedings of the 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, Canada, 5–8 October 2017; pp. 477–482. [Google Scholar] [CrossRef] [Green Version]
Montañez, G.D. A Unified Model of Complex Specified Information. BIO-Complexity 2018, 2018, 1–26. [Google Scholar] [CrossRef]
Díaz-Pachón, D.A.; Sáenz, J.P.; Rao, J.S. Hypothesis testing with active information. Stati. Probab. Lett. 2020, 161, 108742. [Google Scholar] [CrossRef]
Carter, B. Large Number Coincidences and the Anthropic Principle in Cosmology. In Confrontation of Cosmological Theories with Observational Data; Longhair, M.S., Ed.; D. Reidel: Dordrecht, The Netherlands, 1974; pp. 291–298. [Google Scholar]
Barrow, J.D.; Tipler, F.J. The Anthropic Cosmological Principle; Oxford University Press: Oxford, UK, 1988. [Google Scholar]
Davies, P. The Accidental Universe; Cambridge University Press: Cambridge, UK, 1982. [Google Scholar]
Lewis, G.F.; Barnes, L.A. A Fortunate Universe: Life In a Finely Tuned Cosmos; Cambridge University Press: Cambridge, UK, 2016. [Google Scholar] [CrossRef]
Rees, M.J. Just Six Numbers: The Deep Forces That Shape The Universe; Basic Books: New York, NY, USA, 2000. [Google Scholar]
Adams, F.C. The degree of fine-tuning in our universe—Furthermore, others. Phys. Rep. 2019, 807, 1–111. [Google Scholar] [CrossRef]
Barnes, L.A. The Fine Tuning of the Universe for Intelligent Life. Publ. Astron. Soc. Aust. 2012, 29, 529–564. [Google Scholar] [CrossRef]
Tegmark, M.; Rees, M.J. Why is the cosmic microwave background fluctuation level 10⁻⁵. Astrophys. J. 1998, 499, 526–532. [Google Scholar] [CrossRef]
Tegmark, M.; Aguirre, A.; Rees, M.; Wilczek, F. Dimensionless constants, cosmology, and other dark matters. Phys. Rev. D 2006, 73, 023505. [Google Scholar] [CrossRef]
Díaz-Pachón, D.A.; Hössjer, O.; Marks, R.J., II. Is Cosmological Tuning Fine or Coarse? J. Cosmol. Astropart. Phys. 2021, 2021, 020. [Google Scholar] [CrossRef]
Díaz-Pachón, D.A.; Hössjer, O.; Marks, R.J., II. Sometimes size does not matter. Found. Phys. 2022. under revision. [Google Scholar]
Dingjan, T.; Futerman, A.H. The fine-tuning of cell membrane lipid bilayers accentuates their compositional complexity. BioEssays 2021, 43, e2100021. [Google Scholar] [CrossRef]
Dingjan, T.; Futerman, A.H. The role of the `sphingoid motif’ in shaping the molecular interactions of sphingolipids in biomembranes. Biochim. Biophys. Acta BBA Biomembr. 2021, 1863, 183701. [Google Scholar] [CrossRef]
Thorvaldsen, S.; Hössjer, O. Using statistical methods to model the fine-tuning of molecular machines and systems. J. Theor. Biol. 2020, 501, 110352. [Google Scholar] [CrossRef]
Asmussen, S.; Glynn, P.W. Stochastic Simulation: Algorithms and Analysis; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
Siegmund, D. Importance Sampling in the Monte Carlo Study of Sequential Tests. Ann. Stat. 1976, 4, 673–684. [Google Scholar] [CrossRef]
Lehmann, E.L.; Casella, G. Theory of Point Estimation, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 1998. [Google Scholar]
Robert, C.P.; Casella, G. Monte Carlo Statistical Methods; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
Hastings, W.K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1970, 57, 97–109. [Google Scholar] [CrossRef]
Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H. Equation of State Calculations by Fast Computing Machines. J. Chem. Phys. 1953, 21, 1087–1092. [Google Scholar] [CrossRef]
Kirkpatrick, S.; Gelatt, C.D., Jr.; Vecchi, M.P. Optimization by Simulated Annealing. Science 1983, 220, 671–680. [Google Scholar] [CrossRef]
Ross, S. Introduction to Probability Models, 8th ed.; Academic Press: Cambridge, MA, USA, 2003. [Google Scholar]
Asmussen, R.; Nerman, O.; Olsson, M. Fitting Phase-type Distributions via the EM Algorithm. Scand. J. Stat. 1996, 23, 419–441. [Google Scholar]
Neuts, M.F. Matrix-Geometric Solutions in Stochastic Models: An Algorithmic Approach; Johns Hopkins University Press: Hoboken, NJ, USA, 1981. [Google Scholar]
Hössjer, O.; Bechly, G.; Gauger, A. On the waiting time until coordinated mutations get fixed in regulatory sequences. J. Theor. Biol. 2021, 524, 110657. [Google Scholar] [CrossRef] [PubMed]
Varadhan, S.R.S. Large Deviations and Applications; SIAM: Philadelphia, PA, USA, 1984. [Google Scholar]
Hössjer, O.; Díaz-Pachón, D.A.; Chen, Z.; Rao, J.S. Active information, missing data, and prevalence estimation. arXiv 2022, arXiv:2206.05120. [Google Scholar] [CrossRef]
Hössjer, O.; Díaz-Pachón, D.A.; Rao, J.S. Active Information, Learning, and Knowledge Acquisition. PsyArXiv 2022. [Google Scholar] [CrossRef]
Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement Learning: A Survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
Durrett, R. Probability Models for DNA Sequence Evolution. Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
Moran, P.A.P. Random processes in genetics. Math. Proc. Camb. Philos. Soc. 1958, 54, 60–71. [Google Scholar] [CrossRef]
Moran, P.A.P. A general theory of the distribution of gene frequencies—I. Overlapping generations. Proc. Roy. Soc. Lond. B 1958, 149, 102–112. [Google Scholar]
Mitchell, M. An Introduction to Genetic Algorithms; MIT Press: Cambridge, MA, USA, 1996. [Google Scholar]
Vikhar, P.A. Evolutionary algorithms: A critical review and its future prospects. In Proceedings of the 2016 International Conference on Global Trends in Signal Processing, Information Computing and Communication (ICGTSPICC), Jalgaon, India, 22–24 December 2016; pp. 261–265. [Google Scholar]
Abel, D.L.; Trevors, J.T. Three subsets of sequence complexity and their relevance to biopolymeric information. Theor. Biol. Med. Model 2005, 2, 29. [Google Scholar] [CrossRef] [PubMed]
Durston, K.K.; Chiu, D.K.Y. A functional entropy model for biological sequences. Dynamics of Continuous, Discrete & Impulsive Systems, Series B: Applications & Algorithms, Supplement. In Proceedings of the International Conference on Engineering Applications and Compuational Algorithms, Guelph, ON, Canada, 27–29 July 2005; Liu, X., Ed.; pp. 722–725. [Google Scholar]
Durston, K.K.; Chiu, D.K.Y. Functional Sequence Complexity in Biopolymers. In The First Gene: The Birth of Programming, Messaging and Formal Control; Abel, D.L., Ed.; LongView Press: New York, NY, USA, 2011; pp. 147–169. [Google Scholar]
Durston, K.K.; Chiu, D.K.Y.; Abel, D.L.; Trevors, J.T. Measuring the functional sequence complexity of proteins. Theor. Biol. Med. Model 2007, 4, 47. [Google Scholar] [CrossRef] [PubMed]
Díaz-Pachón, D.A.; Marks, R.J., II. Active Information Requirements for Fixation on the Wright-Fisher Model of Population Genetics. BIO-Complexity 2020, 2020, 1–6. [Google Scholar] [CrossRef]
Kallenberg, O. Foundations of Modern Probability, 3rd ed.; Springer: Berlin/Heidelberg, Germany, 2021; Volume 2. [Google Scholar]
Popov, S. Two-Dimensional Random Walk: From Path Counting to Random Interlacements; Cambridge University Press: Cambridge, UK, 2021. [Google Scholar] [CrossRef]
Grimmett, G.; Stirzaker, D. Probability and Random Processes, 3rd ed.; Oxford Univeristy Press: Oxford, UK, 2001. [Google Scholar]
Komarova, N.L.; Sengupta, A.; Nowak, M.A. Mutation-selection networks of cancer initiation: Tumor suppressor genes and chromosomal instability. J. Theor. Biol. 2003, 223, 433–450. [Google Scholar] [CrossRef]

Figure 1. Plot of

I^{+} (θ) = {lim}_{t \to \infty} I^{+} (θ, t)

in (80) as a function of

θ

for a system of molecular machines with transition kernel (73), proposal distribution (78), and null distribution (79). The system has

d = 5

components,

b = 1.0

, and

a = - 0.2

(dash−dotted),

a = 0

(solid) and

a = 0.2

(dashed). The horizontal dotted line corresponds to the functional information

I_{f 0} = 3.47

.

Figure 1. Plot of

I^{+} (θ) = {lim}_{t \to \infty} I^{+} (θ, t)

in (80) as a function of

θ

for a system of molecular machines with transition kernel (73), proposal distribution (78), and null distribution (79). The system has

d = 5

components,

b = 1.0

, and

a = - 0.2

(dash−dotted),

a = 0

(solid) and

a = 0.2

(dashed). The horizontal dotted line corresponds to the functional information

I_{f 0} = 3.47

.

Figure 2. Plot of

I^{+} (θ) = {lim}_{t \to \infty} I^{+} (θ, t)

in (80) as a function of

θ

for a system of molecular machines with transition kernel (73), proposal distribution (78), and null distribution (79). The system has

d = 5

components,

b = 0.5

, and

a = - 0.2

(dash−dotted),

a = 0

(solid), and

a = 0.2

(dashed). The horizontal dotted line corresponds to the functional information

I_{f 0} = 5.09

.

Figure 2. Plot of

I^{+} (θ) = {lim}_{t \to \infty} I^{+} (θ, t)

in (80) as a function of

θ

for a system of molecular machines with transition kernel (73), proposal distribution (78), and null distribution (79). The system has

d = 5

components,

b = 0.5

, and

a = - 0.2

(dash−dotted),

a = 0

(solid), and

a = 0.2

(dashed). The horizontal dotted line corresponds to the functional information

I_{f 0} = 5.09

.

Figure 3. Plot of

I^{+} (θ, t)

in (80) (dashed) and

I_{s}^{+} (θ, t)

(solid) as a function of t for a system of molecular machines with transition kernel (73), proposal distribution (78), and null distribution (79). The system has

d = 5

components and

θ = 2.5

. The upper (lower) row corresponds to

b = 1

(

b = 0.5

), whereas the left (right) column corresponds to

a = 0.2

(

a = - 0.2

). The horizontal lines in each figure illustrate

I^{+} (θ)

(dash−dotted) and the functional information

I_{f 0}

(dotted).

Figure 3. Plot of

I^{+} (θ, t)

in (80) (dashed) and

I_{s}^{+} (θ, t)

(solid) as a function of t for a system of molecular machines with transition kernel (73), proposal distribution (78), and null distribution (79). The system has

d = 5

components and

θ = 2.5

. The upper (lower) row corresponds to

b = 1

(

b = 0.5

), whereas the left (right) column corresponds to

a = 0.2

(

a = - 0.2

). The horizontal lines in each figure illustrate

I^{+} (θ)

(dash−dotted) and the functional information

I_{f 0}

(dotted).

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Díaz-Pachón, D.A.; Hössjer, O. Assessing, Testing and Estimating the Amount of Fine-Tuning by Means of Active Information. Entropy 2022, 24, 1323. https://doi.org/10.3390/e24101323

AMA Style

Díaz-Pachón DA, Hössjer O. Assessing, Testing and Estimating the Amount of Fine-Tuning by Means of Active Information. Entropy. 2022; 24(10):1323. https://doi.org/10.3390/e24101323

Chicago/Turabian Style

Díaz-Pachón, Daniel Andrés, and Ola Hössjer. 2022. "Assessing, Testing and Estimating the Amount of Fine-Tuning by Means of Active Information" Entropy 24, no. 10: 1323. https://doi.org/10.3390/e24101323

APA Style

Díaz-Pachón, D. A., & Hössjer, O. (2022). Assessing, Testing and Estimating the Amount of Fine-Tuning by Means of Active Information. Entropy, 24(10), 1323. https://doi.org/10.3390/e24101323

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Assessing, Testing and Estimating the Amount of Fine-Tuning by Means of Active Information

Abstract

1. Introduction

1.1. Fine-Tuning

1.2. The Present Article

2. Specificity and Target

Interpretation of Target

3. Active Information for Exponentially Tilted Systems

3.1. Exponential Tilting

3.2. Metropolis–Hastings Systems with Exponential Tilting Equilibrium

3.3. Active Information for Metropolis–Hastings Systems in Non-Equilibrium

3.4. Active Information for Metropolis–Hastings Systems with Stopping

4. Estimating Active Information and Testing Fine-Tuning

4.1. Null Distribution Known

4.1.1. Nonparametric Estimator and Test

4.1.2. Parametric Estimator and Test

4.1.3. Comparison between Nonparametric and Parametric Estimates of Actinfo

4.2. Null Distribution Unknown

4.2.1. One Sample Available

4.2.2. Two Samples Available

5. Examples

6. Discussion

7. Proofs

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI