An Algorithmic Approach to Emergence

We suggest a quantitative and objective notion of emergence. Our proposal uses algorithmic information theory as a basis for an objective framework in which a bit string encodes observational data. A plurality of drops in the Kolmogorov structure function of such a string is seen as the hallmark of emergence. Our definition offers some theoretical results, in addition to extending the notions of coarse-graining and boundary conditions. Finally, we confront our proposal with applications to dynamical systems and thermodynamics.


Introduction
Emergence is a concept often referred to in the study of complex systems. Coined in 1875 by the philosopher George H. Lewes in his book Problems of Life and Mind [1], the term has ever since mainly been used in qualitative discussions [2,3]. In most contexts, emergence refers to the phenomenon by which novel properties arise in a complex system which is composed of a large quantity of simpler subsystems that do not exhibit those novel properties by themselves, but rather through their collective interactions. The following citation from Wikipedia [4] reflects this popular idea: "For instance, the phenomenon of life as studied in biology is an emergent property of chemistry, and psychological phenomena emerge from the neurobiological phenomena of living things".
For claims such as the above to have a precise meaning, an objective definition of emergence must be provided. Current definitions are framed around a qualitative evaluation of the "novelty" of properties exhibited by a system with respect to those of its constituent subsystems. This state of matters renders generic use of the term ambiguous and subjective, and hence potentially problematic within a scientific discussion. In this paper, we attempt to free the notion of emergence from subjectivity by proposing a mathematical and objective notion of emergence.
The constructionist hypothesis breaks down when confronted with the twin difficulty of scale and complexity. [. . . ] at each level of complexity, entirely new properties appear, and the understanding of the new behaviours requires research which I think is as fundamental in its nature as any other. [. . . ] At each stage, entirely new laws, concepts, and generalizations are necessary, requiring inspiration and creativity to just as great a degree as the previous one. Psychology is not applied biology, nor biology is applied chemistry.
More recently, David Wallace [7] (Chapter 2) qualifies emergent entities to be "not directly definable in the language of microphysics (try defining a haircut within the Standard Model) but that does not mean that they are somehow independent of that underlying microphysics". Emergent objects are instead seen as patterns, and the existence of those patterns as real things is subjected to a criterion that Wallace attributes to Dennett [8]. The patterns are real if the theories that admit them in their ontology gain in usefulness, in explanatory power or in predictive reliability. For instance, Dennett's criterion mandates that temperature be thought of an emergent but real concept because it is a useful pattern. In fact, temperature is not a basic entity of the microphysics, yet a full scientific description of a large system with no reference to the notion temperature completely misses a fundamental aspect. In this spirit, temperature is as real as it is useful, namely, as real as it is used in key phenomenological explanations. Useful patterns, or structures, is a central concept that is formalized within algorithmic information theory, more precisely in non-probabilistic statistics.

From Systems to Bit Strings
To better appreciate our reliance on algorithmic information theory, we present our epistemological standpoint. We take the realist view that there is a world outside and independent of our perception. This world is made of physical systems, and the goal of science is to understand their properties, their dynamics and their possibilities. This is done through an interplay between the formulation of theories-bold conjectures about how the world is [9]-and their experimental challenges. Pragmatically, theories can serve the purpose of providing simple models to explain the data, an idea which will be explored throughout the paper. Empirical observation, on the other hand, collects data from physical systems. But what is the nature of such data collection? How can one extract from a system, assumed to exist in reality, a string of symbols that can be taken to be binary?
The answer lies in the physics of the measurement process. Observation consists of an interaction between the physical system one cares to learn about and some prepared measurement apparatus. The measurement apparatus then interacts with a computing device (this can be the experimenter) that arranges its memory in a physical representation of a bit string x. A scientist collecting data about a system shall then be left with a string x, which, clearly, is not only determined by the investigated system. The information in x could reflect properties of other systems with which it has previously interacted, such as surrounding systems (the environment), the measurement apparatus and the scientist itself. As observed by Gell-Mann and Lloyd [10], this introduces several sources of arbitrariness into x, in addition to the level of details of the measurement and the coding convention that maps the apparatus's configuration into bits. Furthermore, the knowledge and cognitive biases of the scientist impact what is being measured. For Gell-Mann and Lloyd, all this arbitrariness is to be discarded in order to define the (algorithmic) information content of the system through that of x. We do not share this view, as we think that this arbitrariness inhibits a well-posed definition. In fact, a subtlety of scientific investigation concerns how to probe the system in order to push into x the yet-to-be-understood features that it can exhibit; and this shall never be constrained by a fixed method.
Nonetheless, the subjective and error-prone connection between the physical system and the recorded data does not preclude an overall objective modelling of the world. This is because, after all, the string originates from a real physical process. Subjectivity and errors impact which process occurs, yet a sufficiently careful analysis of the data finds the best explanation of its origin, and so also explains the scientist's choices and the unwanted interactions. For instance, if we ask a dishonest scientist to give us data about a system, but he elects instead to give us bits at whim, then investigating the data will lead to models of what was happening in that person's brain, which is itself a part of reality. Thus, the string x is always objective data from a real system, although not necessarily the one that was presumed to be under investigation.
Once the data x is fixed, we face the problem of finding the best explanations for it, which is related to finding its patterns or structures. This is the main investigation of the paper. It can be done in the realm of algorithmic information theory (AIT), a branch of mathematics and logic that offers similar tools as probability theory, but with no need for unexplained randomness. Li and Vitányi, authors of the most cited textbook [11] in the field, claim that "Science may be regarded as the art of data compression" and according to the pioneer Gregory Chaitin [12], "[A] scientific theory is a computer program that enables you to compute or explain your experimental data". Indeed, even theoretical pen and paper work constitutes symbolic manipulations which are inherently algorithmic (see Figure 1).

System
Data A point remains to be addressed. Why work with classical information and computation instead of their quantum counterparts? As quantum computation can be classically emulated [13], the quantum gain is only in speed, and not fundamental in terms of what can or cannot be computed. This work is grounded in computability theory, so by leaving aside questions of time complexity, we also leave aside quantum computation.

Outline
This paper is organized as follows. In Section 2, we give a review of the basic notions of algorithmic information theory, with a particular focus on non-probabilistic statistics and connections in physics. Building on those, we introduce in Section 3 an algorithmic definition of emergence and we derive from it some concepts and results. In Section 4, we illustrate the relevance of the proposed definition in a toy model as well as in applications to dynamical systems and thermodynamics.

A Primer on Algorithmic Methods
Algorithmic information theory (AIT) [14][15][16] is the mergence of Shannon's theory of information [17] and Turing's theory of computation [18]. Introduced in his seminal paper titled "A Mathematical Theory of Communication", Shannon's theory concerns the ability to communicate a message that comes from a random source of symbols. In this context, the randomness is formalized in the probabilistic setting, and represents ignorance or unpredictability of the symbols to come. The entropy is then a functional on the underlying distribution that quantifies an optimal compression of the message. Concretely, this underlying distribution is often estimated through the observed biases in the frequency of the sequences of symbols to transmit. However, noticing such biases is only a single way to compress a message. For instance, if Alice were to communicate the 10 10 first digits of π to Bob, a pragmatic application of Shannon's information theory would be of no help since the frequencies of the symbols to transmit are uniform (if π is normal, which it is conjectured to be). However, Alice could simply transmit: 'The first 10 10 digits of 4 Bob then understands the received message as an instruction that he runs on a universal computing device to obtain the desired message. Equipping information theory with universal computation enables message compression by all possible (computable) meansand not just via statistical biases. As we will see, the length of the best compression of a message is a natural measure of the information contained in the message.

Algorithmic Complexity
We give the basic definitions and properties of algorithmic complexity. See [11] (Chapters 1-3) for details, attributions and background on computability theory.
The algorithmic complexity K(x) of a piece of data x is the length of its shortest computable description. It can be understood as the minimum amount of information required to produce x by any computable process. Per contra to Shannon's notion of information, which supposes an a priori random process from which the data has originated, algorithmic complexity is an intrinsic measure of information. Because all discrete data can be binary coded, we consider only finite binary strings (referred to as "strings" from now on), i.e., where stands for the empty word. For a meaningful definition, we have to select a universal computing device U on which we execute the computation to obtain x from the description. Such a description is called a program p, and since it is itself a string, the length of p is well defined as the number of bits in it and denoted |p|. Therefore, Note that abstractly, U can be thought of as a universal device in any Turing-complete model of computation. In the realm of Turing machines, a universal device expects an input p encoding a pair p = q, i and simulates the machine of program q on input i. Concretely, U can be thought of as a modern computer or a human with pen and paper. This is the essence of the Church-Turing thesis, according to which all sufficiently generic approaches to symbolic manipulations are equivalent and encompass physically realizable computations. The invariance theorem for algorithmic complexity guarantees that no other formal mechanism can yield an essentially shorter description. This is because the reference universal computing device U can simulate any other computing device V with a constant overhead in program length; i.e., there exists a constant C U V such that holds uniformly for all x. If V is also a reference universal computing device, then K U (x) and K V (x) differ by at most a fixed constant. It is customary in this field to use the big-O notation and express an error term as a function of |x| = n. Henceforth, the character n shall be reserved to denote the length of x. In general, O( f (n)) denotes a quantity whose absolute value does not exceed f (n) by more than a fixed multiplicative factor, so in this case, we can write K U (x) ≤ K V (x) + O(1). Throughout the paper, O(log n) error terms shall often occur. For significant data set, |x| = n is very large and so O(log n) is comparatively a very small quantity. For instance, if α ≤ n, then K(α) ≤ O(log n) since α is a (binary) number of at most log n + 1 bits, and one program to compute α is simply to enumerate all those bits.
Since the ambiguity in the choice of computing devices is lifted (up to an additive constant), we omit the subscript U in the notation. Algorithmic complexity is in this sense a universal measure of the complexity of x.
The conditional algorithmic complexity K(x | y) of x relative to y is defined as the length of the shortest program to compute x, if y is provided as an auxiliary input. More formally, Multiple strings x 1 , . . . , x n can be encoded into a single one denoted x 1 , . . . , x n . The algorithmic complexity K(x 1 , . . . , x n ) of multiple strings is then defined as For technical reasons, we restrict the set of programs resulting in a halting computation to be such that no halting program is a prefix of another halting program, namely, the set of halting programs is a prefix code. One way to impose such a constraint is to have all programs self-delimiting, meaning that the computational device U halts its computation after reading the last bit of the program p, but no further. This restriction is not fundamentally needed for our purposes, but it entails an overall richer and cleaner theory of algorithmic information. For instance, the upcoming relation (1) holds within an additive constant only if selfdelimitation is imposed.
A key property of entropy in Shannon's theory is the chain rule that relates the entropy of a pair to those of the constituents. This is also achieved in the realm of AIT. Let x * be the shortest program that computes x. (If there are more than one "shortest program", then x * is the fastest, and if more than one have the same running time, then x * is the first in lexicographic order.) Algorithmic complexity satisfies the important chain rule One obvious procedure to compute the pair of strings x and y is to first compute x out of its shortest program x * , and then use x * to compute y, which proves the "≤" part of (1). The "≥" side, harder to prove, states that the previous procedure to compute x, y is nearly optimal in terms of program length. A looser formulation of the chain rule is where n = max(|x|, |y|). A major drawback of algorithmic complexity for pragmatic use is that it is uncomputable, namely, no algorithm can return K(x) on a generic input x.

Non-Probabilistic Statistics
Standard statistics are founded upon probability theory. Remarkably, the same person who axiomatized probability theory managed to detach statistics and model selection from its probabilistic roots. Kolmogorov suggested [19] that AIT could serve as a basis for statistics and model selection for individual data. See [20] for a modern review.
In this setting, a model of x is defined to be a finite set S ⊆ {0, 1} * such that x ∈ S. It is also referred to as an algorithmic statistic or non-probabilistic statistic. Any model S can be quantified by its cardinality, denoted |S|, and by its algorithmic complexity K(S), yielding a quantitative meaning of "simple" and "complex". To define K(S) properly, let again U be the reference universal computing device. Let p be a program that computes an encoding x 1 , . . . , x N of the lexicographical ordering of the elements of S and halts.
Then, S * is the shortest such program and K(S) is its length. When S and S are two models of x of the same complexity α, we say that S is a better model than S if it contains fewer elements. This is because there is less ambiguity in specifying x within a model containing fewer elements. In this sense, more of the distinguishing properties of x are reflected by such a model. Indeed, among all models of complexity ≤ α, the model of smallest cardinality is optimal for this fixed threshold of complexity.
Any string x of length n exhibits two canonical models shown in Table 1. The first is simply S Babel := {0, 1} n , which has small complexity as it is easy to describe-a program producing it only requires the information about n. However, it is a large set, containing 2 n elements. It is intuitively a bad model since it does not capture any properties of x, except its length. The other canonical example is S x := {x}. This time, S x has large complexity, namely, it is as hard to describe as x is, but it is a very tiny set with a single element. The model S x is also bad as it captures everything about x, even the noise or incidental randomness. This significantly weighs down the description of the model, and is commonly known as over-fitting. A good map of Montreal is not Montreal itself! Table 1. Complexity and cardinality of S Babel and S x .
because one way to compute x out of S is to give the log |S| bit-long index of x in the lexicographical ordering of the elements of S, where · denotes the ceiling function. (Note that the program can be made self-delimiting at no extra cost because the length of the index can be computed from the resource S provided.) This trivial computation of x relative to S is known as the data-to-model code (which should really be called model-to-data.). A string x is a typical element of its model S if the data-to-model code is essentially the shortest program, i.e., if In such a case, there is no simple property that singles out x from the other elements of S. Notice also that the data x can always be described by a two-part description: the model description and the data-to-model code. Hence, In his seminal paper [21] on the foundations of theoretical (probabilistic) statistics, Fischer stated: "The statistic chosen should summarize the whole of the relevant information supplied by the sample. This may be called the Criterion of Sufficiency.". Kolmogorov suggested an algorithmic counterpart. A model S x is sufficient for x if the two-part description with S as a model is an almost-shortest description, namely, Here, the O(log n) refers to K(K(S), log |S|) since the self-delimited two-part code implicitly carries the length of each part as its intrinsic information, while the optimal one-part code, x * , in general does not know about the size of each part.
Such models constrain the set of strings to those sharing "the whole of the relevant" properties that characterize x, which is then a typical element of those models.
Finally, a good model should not give more than the relevant information supplied by the data. The simplest sufficient model displays all the relevant properties of the data, and nothing more, thus preventing over-fitting. It is called the minimal sufficient model and denoted S M .

Kolmogorov's Structure Function
For a given string x, its associated Kolmogorov's structure function explores tradeoffs between the complexity and cardinality of possible models. This function maps any complexity threshold to the log-cardinality of the optimal model S within that threshold. It will be applied to investigate more interesting models, "between" S Babel and S x .
We say that the optimal model of α bits or less witness h x (α). Extremal points of h x (α) are essentially determined by S Babel and S x , as shown in Table 1. It follows that An upper bound for h x is prescribed by noticing that a more complex model, S , can be built from a previously described one, S, by including into the description of S the first bits of index of x ∈ S. In this case, for each bit of index specified, the log-cardinality of the resulting model reduces by one. This implies that the overall slope of the structure function must be ≤−1. (A knowledgeable reader may frown upon this coarse argument because prefix technicalities demand a more careful analysis as is done in Section 3.1. Such an analysis shows that the relations as presented here hold up to logarithmic fluctuations.) Applying this argument to S Babel , we conclude that the graph of h x (α) is upper-bounded by the line n + K(n) − α.
A lower bound is obtained from applying Equation (3) to the model S witnessing h x (α). In such a case, K(S) ≤ α and log |S| = h x (α), so This means that the graph of h x (α) always sits above the line K(x) − α, known as the sufficiency line. The above inequality turns into an equality (up to a logarithmic term) if and only if the witness S is a sufficient model, by definition. Thus, this sufficiency line is reached by the structure function when enough bits of model description are available to formulate a sufficient statistics for x. Once the structure function reaches the sufficiency line, it stays near it, within logarithmic precision, because it is then bounded above and below by the −1 slope linear regime. The sufficiency line is always reached because S x qualifies as a sufficient model.
For concreteness, a plot of h x (α), for some string x of length n, is given in Figure 2. In this example, the string x is such that optimal models of complexity smaller than α M are not very insightful; they give as little information about x as a raw recitation of the first bits of x. Indeed, bits describing those models are used as inefficiently as an enumeration of x. In sharp contrast, S M is exploiting complex structures in x to efficiently constrain the size of the resulting set. It is fundamentally different from the optimal model of α M − 1 bits as it does not recite trivial properties of x, but rather expresses some distinguishing property of the data. Indeed, from α M bits of model, the uncertainty about x is decreased by much more than α M bits, as x is now known to belong to a much smaller set. In this example, S M is the minimal sufficient statistics.
The complexity of the minimal sufficient statistics, α M , is known as the sophistication of the string x, which captures the amount of algorithmic information needed to grasp all structures-or regularities-of the string. Technically, here we refer to set-sophistication, as defined in [22], since sophistication has been originally defined [23] through total functions as model classes instead of finite sets. Importantly, Vitányi has investigated [24] three different classes of model: finite sets, probability distributions (or statistical ensembles, cf. the following section) and total functions. Although they may appear to be of increasing generality, he shows that they are not. Any model of a particular class defines a model in the other two classes of the same complexity (up to a logarithmic term) and log-cardinality (or analogue). Because algorithmic complexity is uncomputable, so is the structure function. However, it can be upper semi-computed, which means that there is an algorithm that keeps outputting better upper bounds of the structure function until it eventually reaches the actual structure function. When this happens, the algorithm does not halt, as it keeps looking for better upper bounds, not knowing that this is in vain. In our generic context of finding explanations for observation data, this upper semi-computation can be seen to be realized by the scientific enterprise, which seeks and finds simpler and better models.

Algorithmic Connections in Physics
Ideas of using AIT and non-probabilistic statistics to enhance the understanding of physical concepts are not new. For example, expanding on the famous Landauer principle, Bennett [25] suggested that thermodynamics is more a theory of computation than a theory of probability, and so better rooted in AIT than in Shannon information theory. Based on his work, Zurek proposed [26] the notion of physical entropy, which generalizes thermodynamic entropy to ensure consistency. In the case of a system with microstate x, the physical entropy is defined based on a statistical ensemble P. The latter is very similar to an algorithmic model for x, except that in general a non-uniform probability distribution governs the elements of P, so the amount of information needed to specify an element x ∈ P, on average, is given by Shannon entropy H(P) = − ∑ x P(x ) log P(x ). Important paradoxes, such as the famous Maxwell's demon [25,27] or Gibbs' paradox [28], appear when it is realized that the ensemble P, and hence the entropy of the system, depends upon the knowledge d held by the agent, i.e., P = P d . Such knowledge is usually given by macroscopic observations such as temperature, volume and pressure, and defines an ensemble P d by the principle of maximal ignorance [29]. However, a more knowledgeableor better equipped-agent may gather more information d about the microstate, which in turn defines a more precise ensemble P x. This leads to incompatible measures of entropy. Zurek's physical entropy S d includes the algorithmic information contained in d as an additional cost to the overall entropy measure of the system, Note that the similarity with Equation (3) is not a mere coincidence. Zurek's physical complexity encompasses a two-part description of the microstate. It first describes a model-or an ensemble-for it, and second, it gives the residual information to obtain from the ensemble to the microstate, on average. In fact, when the ensemble takes a uniform distribution over all its possible elements, Shannon's entropy H(P) reduces to the log-cardinality of the ensemble, which is, up to a k B ln 2 factor, Boltzmann's entropy.
With sufficient data d, the physical entropy S d gets close to the complexity of the microstate K(x). The ensemble P d is then analogous to a sufficient statistics. Indeed, Baumeler and Wolf suggest [30] taking the minimal sufficient statistics as an objective-observerindependent-statistical ensemble (they call it the macrostate). Gell-Mann and Lloyd define [10] the complexity K(P d ) of such a minimal sufficient ensemble to be the Effective Complexity of x. However, because of Vitányi's aforementioned equivalence between model classes, effective complexity is essentially the same idea as sophistication, which is why they also coincide numerically (see also [31] (lemma 21)). Finally, Müller and Szkola [31] have shown that strings of high effective complexity must have a very large logical depth, an idea to which we shall come back in Appendix A.

Defining Emergence
The previous discussion of the Kolmogorov structure function made manifest the fact that a drop like the one displayed in Figure 2 at complexity level α M is associated with a distinguished model that accounts for meaningful properties of the data. In this example, all such properties were reflected in the description of S M . In general, however, structure functions need not be characterized by a single drop. In this spirit, what should be thought of a string whose structure function has many drops, as displayed in Figure 3? With only a few bits of model, not much can be apprehended of x. With slightly more bits, there is a first model, S 1 , capturing some useful properties of x, which leads to a more concise two-part description. Allowing even more bits, a second model S 2 is possible; while being more complex, this second model reflects more properties of x in such a way as to yield an even smaller two-part description. This series of models continues as the allowed complexity increases. Eventually, the structure function reaches the minimal sufficient statistics S M , after which more complex models are of no help in capturing meaningful properties of x. Figure 3. A structure function with many drops.
These observations illustrate that models can prove useful when not displaying all relevant properties of the data. Those "partial" models, while not sufficient, enable a most efficient description of the data with respect to all models of lower or equal complexity. Thus, in the same way that a model witnessing the minimal sufficient statistics is understood to capture the meaningful properties of the data, those intermediate models can be thought of as capturing only some of those meaningful properties. It is from this notion that the proposed definition of emergence is constructed.
Before going any further, a point needs to be addressed: Do strings with a structure function of many drops actually exist? Yes. In [32], it is shown that all shapes are possible, i.e., for any graph exhibiting the necessary properties mentioned in the previous section, there exists a string whose structure function lies within a logarithmic resolution of the graph.
The main idea of our proposal is to relate emergence to the phenomenon by which the experimental data x exhibits a structure function with many drops. They feature regularities that can be grasped at different levels of complexity.

Towards a Definition
In order to sharply define the models corresponding to drops of the structure function, and to make precise in which sense these are "new" and "understand" more properties, we construct a modified structure function upon which we formalize these notions. This shall take us to a definition of emergence.

Index Models
As discussed briefly in the previous section, one can construct models canonically from a given model by appending to it bits of the index of the data x in that model.
is to first execute the (self-delimiting) program that computes S, and then concatenate the following program: The following program has (i + c) bits.
Among the strings of S, keep those whose index start with The first line of the routine is only for the sake of self-delimitation. Note that this concrete description of S[i] implies Furthermore, for every bit of index given, the model S[i] x so defined contains half-fewer elements than S does. Hence, As can be seen from the program displayed in Equation (4), because of self-delimitation, specifying i bits of index requires more than i extra bits of model description-it requires γ(i) := i + K(i) + c , where c accounts for the constant-sized part of program (4). An inverse ofγ(i) can be defined as which represents the number of index bits that can be specified with γ extra bits of model description. Denoting the index model S(γ) := S[ī(γ)] and noticing that the difference between γ andī(γ) is of logarithmic magnitude, one finds that

A Modified Structure Function
We prescribe a procedure that associates to each complexity level α a model S (α) which has log-cardinality very close to h x (α). Intuitively, from lower to higher complexity, α is mapped to the witness of h x (α) whenever α follows a large enough drop of the structure function; furthermore, whenever the structure function is in a slope −1 regime, α is mapped to an index model that is derived from the last witness of h x (α).
Formally, let α 0 be the smallest complexity threshold for which h x (α 0 ) is defined. The sequence of models {S (α) } is defined recursively through To determine whether a drop at complexity level α occurs or not, and so whether a new witness of the structure function is introduced or not, the length α + h x (α) of the two-part description at complexity level α is compared to that of the two-part description permitted by the previous model, S (k α−1 ) . If they differ by more than a threshold quantity Q(k α−1 ), a drop occurs. Alternatively, a drop can also be thought to occur if the structure function falls below the −1 slope regime by more than Q(k α−1 ). This quantity is given by As a final remark about Equation (6), a unique witness of h x (α) is determined as the model that has log |S| ≤ h x (α) and is produced by the first program of length α to halt with output S.
It follows from this definition thath x lies within an additive logarithmic term above h x , and the two functions coincide after a drop. Why define a modified structure functionh x , which is very close to the original structure function h x ? First, a "drop" of the structure function is clearly identified: it corresponds to a point in the construction of the modified structure function where the model used is updated rather than built with bits of index. Second, Equation (6) keeps track of the actual models used at each complexity threshold as witnesses ofh x (α). They are of only two kinds, either updated to a new witness of h x (α) or built with more bits of index. In the original structure function, for two points α and β in a slope −1 regime, nothing guarantees that the models witnessing h x (α) and h x (β) build on the same idea. They could a priori be completely different models, capturing completely different properties about the string x, but it just happens that the difference of their log-cardinality is roughly β − α. However, the defining models ofh x are constructed in a way that the −1 slope forces the models to reflect the same idea. They simply contain more or less of the index of x. It is the departure from the slope −1 regime in the functionh x that indicates that a new model is used, one that intuitively captures other properties of x. These models shall be the ones of interest.

Minimal Partial Models as a Signature of Emergence
We have emphasized in the construction of the modified structure function a difference between the slope −1 regime and the drops of the structure function. Indeed, while the former amounts to index models, the latter corresponds to relevant yet partial models. These will be central to our proposed definition of emergence.
The set of numbers {k α } α∈{α 0 ,...,K(x)+O(1)} corresponds to the set of α's for which there are drops in the structure function.

Definition 5. (Emergence)
Emergence is the phenomenon characterized by observation data that display several minimal partial models.
It can be seen that the above definition maintains the generality expected of the notion of emergence, allowing for it to be applied in many different contexts. Moreover, it will be seen to allow for a mathematical treatment of various related notions.
In view of the comments in Section 1.2, emergence is a function of the observation string x and not necessarily of the real object that x is purported to represent. For instance, in the case of the dishonest scientist who disregards the object under investigation to give bits at whim, any emergence displayed by x would arise from the system that produced those bits-the scientist's brain, which is itself a part of reality.

Quantifying Emergence
Under the proposed definition of emergence, we develop quantitative statements. In this section, three theorems are presented. Ideas related to these results have previously been developed in algorithmic statistics [20,32,33].
To avoid many technicalities and precisions, the theorems presented in this section are formulated up to logarithmic error terms. More precise statements together with their corresponding proofs are given in Appendix A.

The Data Specifies the Minimal Partial Models
The first theorem confirms a basic intuition. The minimal partial models should be thought of as optimized ways to give the structural information about x, and so it should contain almost only information about x. The following theorem confirms that this is the case. Most of the algorithmic information of the minimal partial models is in fact information about x. Using the chain rule of Equation (2) entails rephrasing the statement as so producing S i in order to get x is not a waste; in fact, it is almost completely a part of the algorithmic information of x.
Proof. We give a program q of length O(log n) that computes S i out of x. q : Give explicitly K(S i ) and log|S i | Run all programs p of length K(S i ) in parallel If p halts with U (p) = S : If log|S| ≤ log|S i | and x ∈ S: Print S and halt.
Only the first line of q has a non-constant length, and since both K(S i ) and log |S i | are smaller quantities than n, giving them explicitly requires O(log n) bits.

Partial Understanding
We now justify the use of the term "partial" to qualify the non-sufficient minimal partial models. Intuitively, sharp drops of the structure function should be in correspondence with non-trivial properties of the underlying string. The minimal partial models at those points should encompass an "understanding" of these properties. Naturally, the magnitude of this understanding could be equated to the size of the drop. Theorem 2 confirms this idea, when "understanding" holds the following meaning.
In the context of AIT, understanding amounts to reducing redundancy, as a good explanation is a simple rule that accounts for a substantial specification of the data. For instance, when one understands a grammar rule of some foreign language, that rule can be referred to in order to explain its many different instantiations. Those instantiations are redundant, and once the grammar rule is specified, this redundancy is reduced. The redundancy of a string is thus the number of bits of a string that are not irreducible algorithmic information. In other words, it is the compressible part of x. Redundancy could then be thought of as a quantification of how much there is to be understood about x upon learning x * . Comparing x to x * , however, is an all-or-nothing approach, and the purpose of non-probabilistic statistics is to make sense of partial understanding by studying (two-part) programs for x that interpolate between the "Print x" and the x * explanations. The next definition generalizes redundancy so that it can be relative to an algorithmic model. We can then explore how much each minimal partial model reduces the randomness deficiency-or understands-the data x. Define d i as the height of the drop just before getting to S i , namely,

Theorem 2.
The height of the i-th drop measures how much more S i reduces the randomness deficiency, compared to S i−1 , i.e., Proof. Using the chain rule in Equation (2) twice, which amounts to a bayesian inversion, and Theorem 1, With the help of Figure 4, and recalling that if γ extra bits of model description are available,ī(γ) = γ + O(log n) bits of index can be given, observe that Using Equation (7), We can then interpret the algorithmic information in the minimal partial models as the algorithmic information of x that enables a reduction of the redundancy of x. This reduction of redundancy can be quantified by the sum of all previous drops, and the amount of redundancy left to be reduced is the sum of the drops to come. When the minimal sufficient statistic is described, with only α M bits of the algorithmic information in x, the redundancy of x is completely reduced. The remaining information left to specify is then the index of x ∈ S M , which is itself irreducible algorithmic information in x. However, this information is not the relevant structural information about x-it is the incidental information.

Hierarchy of Minimal Partial Models
The following theorem shows that the algorithmic information in the minimal partial models is organized in a nested structure, namely, the information in the simpler minimal partial models is mostly contained in the more complex ones. Put differently, the complex minimal partial models can be used to compute the simpler ones with some short advice.
The proof is relegated to Appendix A. Note that when the structure function is very steep, the values {α i } which mark the complexity of the minimal partial models in this steep regime are close to one another, so the leading term α i − α i−1 is small. In the limit where the (actual) structure function is so steep that it actually drops by more than Q(α i − 1) between the two neighbouring points α i − 1 and α i , then the leading term vanishes, and one finds K(S i |S j ) = O(log n). This is formally stated and proved in the appendix (see Lemma A2).

Extending Concepts
We revisit the notions of coarse-graining and boundary conditions, broadening their scope.

A Notion of Coarse-Graining
Many approaches to emergence appeal to some notion of coarse-graining. For instance, the relevant quantities of a physical system might correspond to functions over state space. In this case, an important tool consists in averaging those quantities over regions of that space, retaining only the large scale structures, as is done in the method of effective field theories. In the context of algorithmic statistics, coarse-graining will be seen as a special case of what we will call regraining. We begin by defining the notion of coarse-graining precisely.
In set theory, coarse-grainings are defined from some mother set Ω. A partition P of Ω is a collection of disjoint and non-empty subsets such that their union gives back Ω. Let P fine and P coarse be two partitions of Ω. If every element in P fine is a subset of some element of P coarse , then P fine is a refinement of P coarse and P coarse is a coarse-graining of P fine . In physics, those partitions are usually specified through non-injective functions globally defined on the state space via the pre-images.
The key point of non-probabilistic statistics is to investigate an individual object x, without needing to refer to other x in the set of bit strings. Hence, algorithmic models are disconnected from the notion of partition, since a single set is defined specifically for x, with no requirement to define a corresponding set for x . As such, algorithmic models do not partition bit strings. Still, an algorithmic model A x could be qualified as a model coarse-graining of a B x if B ⊆ A. This type of model coarse-graining in fact occurs in the regime of index models. This motivates the extended notion of regraining, which is simply a change from some model A x to B x, where neither model needs to be a subset of the other. It is qualified as a fine regraining if |B| < |A| and a coarse one if |B| > |A|. Model coarse-grainings are particular cases of regrainings.
The optimal regraining corresponds to jumping along the minimal partial models. The optimal coarse regraining occurs in the direction from S M to S 1 , and corresponds to modelling fewer and fewer features of the data x to the benefit of having simpler and simpler models. It is optimal in the sense that this procedure will yield the best possible models over all complexity thresholds. As opposed to the usual approaches, where the coarse-graining occurs by averaging over specific variables (space, momentum, time. . . ), our notion of regraining is parametrized by theory size, and so encompasses all computable ways in which a coarse-graining can be done. If properties of a physical system are such that the optimal coarse regraining happens by averaging over a specific variable, like space configuration, then the algorithmic regraining shall boil down to the usual method.

Boundary Conditions
Recall from Section 1.2 that an intrinsic difficulty of scientific investigation is that the recorded data x never perfectly reflects a single system. Even if we leave aside the effect of the measurement apparatus and the scientist on the data, it remains that systems are never completely isolated from an environment. As any interaction mediates an exchange of information, the effect of a large and complex environment will be modelled as random noise in models of small complexity. (An example of this situation is given by the dissipation-fluctuation theorem [34] that relates dissipative interactions in a system to the statistical fluctuations around its equilibrium point. Indeed, this theorem relates dissipation, an irreversible process that does not preserve information, with noise in the form of statistical fluctuations.) However, if the string x is sufficiently detailed, some structures of the environmental "noise" are grasped by models complex enough. This highlights that some information in x may be explained by models of great complexity but remain unexplained by a simple model. Such information can then be thought to reside outside of a simple model, namely, in its data-to-model code. This suggests that one should interpret the data-to-model code as the boundary of the model. In this definition, the scope of the term is broadened from its physical meaning, so that it can be thought of as the boundary of a model S, namely, the part of the system that generated the observational data x that is not modelled by S. The remaining structure in x is then viewed as coming from non-typical boundary conditions, often arising from interactions with an environment. In the case of the minimal sufficient statistics S M , the typicality of x in S M captures the fact that the boundary conditions are arbitrary with respect to the model.
The traditional space-time boundary conditions (e.g., the initial position and velocity of a particle) of a system are an example of what is usually relegated to the data-to-model code, as models usually do not aim at explaining them. Another example is the precise values of mechanical friction coefficients. Within classical mechanics, these values come from outside the theory and would thus be a part of the boundary conditions when understood as per Definition 8. However, with more precise observations, one could explain the values of the coefficients from a more precise model that encompasses molecular interactions. More examples are provided in the following section.

Applications
The versatility of the proposed approach to emergence is now illustrated through some applications. This section is not meant to be an exhaustive review of the possible uses of these definitions, but should rather be understood as a complement to the main exposition whose purpose is twofold. We first illustrate the ideas developed with a concocted example, then we connect some of them to dynamical systems in physics.

Simulation of a 2D Gas Toy Model
As a first application, we consider a toy model for a non-interacting 2D gas on a lattice. The gas is taken to be spatially confined on an L × L grid with a discrete time evolution. Using a pseudorandom number generator, we choose an initial position and momentum for each of the N particles. Each momentum is only a direction in the set {l, r, u, d}, corresponding to left, right, up and down. The gas then evolves according to simple rules. A single particle, represented by a 1 in the lattice, just keeps its trajectory and momentum, as in Figure 5. When it bounces off a boundary, its momentum gets flipped, as in Figure 6. Intersecting trajectories are represented as displayed in Figure 7.   At any time (including the initial time), if two or more particles are at the same site, we simply write down the number of particles in the site and keep track of the momenta. As an observation x, we extract for each of the first T time steps the state in configuration space (i.e., we ignore momentum). One visual way to encode the state in configuration space is to write in each of the L 2 sites a 0, and write to the left of it, in unary, the number of particles in that site. The table of 0s and 1s, can then be expanded into a string. For instance, a 3 × 3 grid example of this encoding is given in Figure 8. At each time step, the bit string corresponding to the configuration state has one 0 for each site, and one 1 for each particle, so its total length is conserved in time and thus the observational data x could be obtained by a mere concatenation of the T configuration encodings. To have an idea of the course of x's structure function, models can be sought and used to determine upper bounds. In the case of this simulated 2D gas, x comes from the known context of the simulation, which provides important clues to find models other than the obvious S Babel and {x}. Gas with the boundary conditions. Even simpler models can be made by pushing into the boundaries the particular values of the external parameters L, N or T. For illustration, we make the argument only for T. Let T be expressed by τ bits in a binary expansion. A simpler model than S L,N,T Gas can be given by producing, for each possible initial condition, all histories of length smaller than 2 τ . We denote this set as S L,N,<2 τ Gas . Its cardinality is 2 τ − 1 times bigger than S L,N,T , thus adding τ to the log-cardinality axis. If T was a random number, and exactly τ bits would be saved on the complexity axis, since only τ, as opposed to T, would be needed to compute the model. In general, however, T is not algorithmically random-it may also contain structures. Since S L,N,<2 τ Gas only encodes τ, a basic upper bound for T, most of the information about T lies outside of what is modelled; it is again the index of x that carries this information as boundary conditions with respect to that model. In a similar fashion, other simple models can be introduced for L and N, pushing again their information into the boundaries of the models. The models S L,N,T Gas and S L,N,<2 τ Gas provide upper bounds to the structure function. Yet, as presented in Figure 9, upper bounds to the structure function still leave one uncertain about the actual course of the structure function. In particular, in the simulation of the gas, the initial conditions were not algorithmically random, as they came from a pseudorandom number generator-a short program which, on a short inputted seed, outputs a sufficiently long string. Hence, the program in the random number generator, together with its seed, is shorter than the length of the initial conditions it generated. The actual structure function reflects this through one more drop at a higher level of complexity. The hypothetical witness of this drop, S RNG , is a model that explains the initial conditions as coming from the pseudorandom number generator. It is the set of all gas histories compatible with the dynamics previously described, and where the initial conditions have been generated with the pseudorandom program, with the seed being relegated to the data-to-model code. If the slope after S RNG remains in a slope −1 regime, the seed is typical. However, in reality, the seed has been produced by another physical system, for instance, by the programmer. Yet again, if the seed is long enough, the structure function could reveal more drops which capture more structures, for instance, the favourite keys of the programmer who entered the seed on his keyboard. This process will go on until all that can be explained has been explained. This example makes clear that the notion of boundary conditions really refers to a theory (or an algorithmic model), and they are fixed somewhat arbitrarily, when the users of the theory are satisfied with their notion of the system that is being modelled. In this case, if what we want to model is the gas, then S L,N,T Gas is good enough, and it is practical to declare that the initial state is typical. However, the reality may be quite different, and what we prescribe as a boundary condition to our theory may in fact be explained by a more complex, deeper theory.

Dynamical Systems
In this second application, we investigate how the notions introduced in this paper appear in the context of dynamical systems. We begin by documenting how the concept of integrability and chaos can be cast in the language of algorithmic information theory. This is followed by an account of how thermodynamics can be seen to emerge, under the proposed definition of emergence, from the application of statistical mechanics to complex dynamical systems.

From Integrability to Chaos
Consider a generic classical system with Hamiltonian H and where the state space M is indexed by a set of real coordinates X = {q i , p i } i∈{1,...,dimM/2} ∈ M. (More precisely, M is a symplectic manifold parametrized locally by real coordinates forming an atlas.) Solutions to the dynamics are curves in M describing the evolution of the state in time. Specifying M, H and an initial point X 0 ∈ M singles out a unique solution curve X t of the dynamics. As a rudimentary formalization of some observation of the system, consider a bounded observable represented by a function f with f : M → [0, 1]. A discrete sequence is constructed from its evaluation { f (X jτ )} j∈{1,...,N} at a regular time interval 0 < τ ∈ Q with negligible K(τ). As this sequence is to represent a series of measurements, one must restrict its resolution, since measured values are always constrained to a finite resolution. For a real number α, we denote by [α] k the truncation of its binary expansion after the first k bits beyond the decimal point such that |[α] k − α| ≤ 2 −k . This truncation effectively restricts the resolution to k bits as the measurement function is upper-bounded by 1. Denoting by f k j ≡ [ f (X jτ )] k the restricted measurements, the recorded observational data string x is then an encoding of the sequence of measurements: We now wish to characterize the complexity of the data string x and study its asymptotic behaviour when the length N of the measurement sequence is increasing. First, one must formulate a meaningful upper bound for K(x). To that end, we require f to preserve information, that is, A trivial bound is then given by the bit length of the encoded sequence of measurements; thus, However, the regularity provided by the laws of motion implies that this bound is not strict. Indeed, given the Hamiltonian H and the manifold M, the machinery of symplectic geometry specifies the dynamical evolution as a set of differential equations that we will denote as H, M . These equations can be integrated numerically from the initial conditions X 0 to obtain f j to a desired precision. These remarks, together with the stated condition on f , imply that The above can be further simplified in view of studying the asymptotic behaviour in N by observing that the dynamical laws M, H , the time interval τ and the resolution k are fixed and independent of N. Thus, as the length of x is scaled by increasing N, they can be taken to be constant. (To simplify the analysis, it is tacitly assumed that the dynamical laws are simple in the sense that the coefficients of the differential equations in H, M are of finite complexity.) Hence, one has Remembering that X 0 encodes the initial conditions, which are a set of real numbers that cannot be constructively specified in general, one is left with a conundrum. Indeed, if X 0 encodes typical real numbers, the upper bound (9) is trivial as the right-hand side is infinite. However, only a finite precision in the initial conditions is required in order to integrate the system to a given precision in the final result. Thus, the resolution in X 0 required is only as much as is needed to compute { f k j } j∈{0,1,...,N} . As such, the asymptotic behaviour of K(x) for N → ∞ is determined by the scaling in the required resolution. A chaotic dynamical system is often characterized by an exponential divergence in the evolution of nearby initial configurations, namely, where | · | denotes a metric on M and λ is known as the Lyapunov exponent. (It is here assumed that the Lyapunov exponent is constant and unique, which is not always the case.) In such a chaotic system, so k bits of precision on X jτ can be achieved by k + λ j bits of precision on X 0 . Therefore, the computation of X Nτ from the initial condition is more efficient-in terms of description length-than straightforward enumeration if k + λ N ≤ kN, or equivalently, if This means that for some values of the Lyapunov exponent λ, time interval τ and precision k, it could be more efficient to simply recite the observed data { f k j } j∈{1,...,N} as a genuinely random string. However, no matter how large the Lyapunov exponent is, there are time steps τ small enough to make it more efficient to calculate { f k j } j∈{1,...,N} from enough bits of initial conditions. Concretely, the precision on the initial conditions that can be obtained is bounded by the resolution of measurement devices. A more practical approach accounts for this with a fixed resolution k > k in the initial conditions and is thus limited to the truncation [X 0 ] k . This, together with the Lyapunov exponent of the system under consideration, determines an interval of predictability within which the observational data x can be compressed. To preserve predictability beyond this interval, one is forced to update one's knowledge of the state of the system with a measurement. The phenomenon is wellknown within chaos theory and shows up as a fundamental limitation to the predictability of such systems. Seen from the algorithmic information lens, chaos can be understood as follows: the bits of the state's description that are initially far away from the decimal point, and hence apparently irrelevant, make their way closer to the decimal point, becoming relevant.
Dynamical systems can generally be organized by considering the asymptotic of the string of measurements x with N → ∞. At one end of the spectrum lie integrable systems, where k bits of knowledge of X 0 can be used all the way through to compute the k bits of f k N . Those include systems where integration can be carried symbolically without an accumulation of errors. On the other side of this spectrum are chaotic systems, where k + λ N bits of X 0 are required to compute the k bits of f k N . Similar classification schemes for dynamical systems that account for integrability and the appearance of chaos based on computational complexity have been proposed previously [35]. An algorithmic perspective on dynamical systems brings the possibility of considering other types of systems, where k + g(N) bits of X 0 can be used to compute the k bits of f k N , with g(N) some a priori generic function.

Thermodynamics and Statistical Mechanics
Statistical mechanics posits the ergodicity of a complex dynamical system in order to obtain a partial, yet useful, description of its behaviour. This partial description is mostly understood to refer to the macroscopic description of a system displaying intractable microscopic descriptions. The generic approach is as follows. Starting again with a Hamiltonian and the associated phase space M, one first investigates the quantities conserved by the time evolution. By fixing those conserved quantities, one establishes constraints that restrict the dimensionality of the accessible phase space. Properly defined, those constraints effectively foliate the phase space into a family of submanifolds F ⊆ M that are each preserved by time evolution. The ergodic hypothesis now posits that the curves X t produced by an initial point X 0 ∈ F under the time evolution are dense in each submanifold F such that the time average value of an observable O : M → R, over such a curve is equal to the average of the same quantity over a uniform measure on each submanifold F, for µ the uniform measure over F. This uniform measure over the submanifolds F is often specified indirectly in terms of the Boltzmann weights of a state X ∈ M over the submanifolds. With the above in mind, thermodynamics can be seen as the study of the interrelation of a relevant collection of macroscopic observables {O i }, expressing the change in the value of some observables in terms of the change in the value of the others. Such a thermodynamic description of a complex system is partial yet useful and relevant to the scale at which one would like to investigate the system.
Let us now concentrate on how this very generic picture of statistical mechanics and its relation to thermodynamics fits under our proposed definition of emergence. We first define the truncation [F] k of a submanifold F ⊂ M to resolution k as the truncation of all coordinates of F to a k-bit resolution. (Technically this truncation depends on the chart, but we take an encoding of F to include an atlas and one for [F] k to include a prescription on the choice of the chart in which to truncate each points.) Then, positing the ergodicity of the system under study enables a direct reframing of statistical mechanics in terms of the ideas of this paper. Indeed, the postulated uniform measure on submanifolds of the phase space M amounts to postulating the corresponding microscopic states in a submanifold to be equally likely under time evolution. In other words, for some large enough finite time interval τ, the sequence nτ ] k } i∈{1,...,dimF}, n∈{1,...,N} , for X (i) i∈{1,...,dimF} coordinates on F, is a typical sample of the truncated submanifold [F] k . The lower bound on the time interval τ that needs to be satisfied for the above to hold is related to the Lyapunov exponent of the system. Indeed, such a bound corresponds to time intervals satisfying the converse of Equation (10). In such a case, x N is essentially an algorithmically random string. From this observation, it follows that for a sufficiently large time interval, which indicates that the model [F] k for the string x N is an algorithmic sufficient statistics. The above discussion emphasized how thermodynamics, together with the ergodic hypothesis, amount to postulating that the models in Equation (11) associated with the decomposition into invariant submanifolds are sufficient. Indeed, a thermodynamical description of the system at equilibrium is in correspondence with such a decomposition of the phase space, provided that the conserved quantities that define the submanifolds are taken to be the thermodynamical variables. Note that [F] k is not a sufficient statistics if and only if the ergodic hypothesis fails. In such a case, one may still consider [F] k as a model of the data, but then the time sampling of the system shall not be algorithmically random; more structures can be found and incorporated to thermodynamical model.

Conclusions
We proposed a mathematical and objective definition of emergence cast in the language of algorithmic information theory. Yielding clear and far-reaching incompleteness results [36,37], this field is rich enough to mathematize mathematics itself. In this paper, we used algorithmic information theory to mathematize epistemology, giving rise to a framework which encapsulates emergence.
Intuitively, emergence is the appearance of novel properties exhibited by a complex system. In most discussions about emergence, the criteria of novelty highly depend upon the field: the aerodynamicist may be stunned by new patterns in fluid dynamics; the biochemist, by new ways in which enzymatic networks interact. In our proposed definition, emergence occurs in "theory space": the thresholds of emergence are marked by the complexity of models that enable an overall shorter expression of the observed data. These models can be thought of as understanding new structures. Although the considered models seem very constrained (they are sets of finite bit strings), they are in fact as general as they can be since they are rooted in universal computation: any "new pattern in fluid dynamics" or "enzymatic networks interaction" that can be described is amenable to a computational process and thus an algorithmic model.
The development of our proposal was done through the minimal partial models. In Section 3, we proved that: 1.
The data specifies almost everything about the minimal partial models; 2.
The magnitude of the drop measures the amount of new understanding; 3.
More complex minimal partial models specify almost completely the simpler ones.
We also extended the notions of coarse-grainings and boundary conditions, freeing them from any specific theory. In Section 4, we considered some applications to dynamical systems and thermodynamics, and found that a new light can be shed on chaos and on the ergodic hypothesis.
The absolute generality of algorithmic-information-theoretic methods comes at the price of uncomputability. For instance, the shapes of Figure 9, in Section 4.1, are only conjectured. No program can return the structure function of a piece of data x. Nevertheless, the definition provides a precise framework to discuss the notion of emergence. A relaxation to the context of limited computational resources may be of interest in order to find concrete utility and applications in real-life computations; while some of the results obtained might no longer hold, the definition itself can still be applied within this limited computational context.
We recognize that the concepts involved in the proof of Theorem 3 in the upcoming Appendix A.3 challenge the reconciliation between our mathematical proposal and the youth of our universe. For a long string, the deep models, namely, those that occur at late drops of the structure function, are the result of programs that terminate after an unthinkably long computation. They have the largest finite running times among all programs no larger in size, so they solve the halting problem for shorter programs. This is the busy beaver regime. At a mere 14 billion years old, our universe seems too young to accommodate such computations, and this seems to hold even if we take into account the parallel computation that occurs in the observable universe. Indeed, as Bennett once put it, "the cube of Hubble's length over Plank's length is not even breakfast for the busy beaver!" However, assessing the actual computational capabilities of all physical phenomena in the entire universe is an open problem. Another way out of this conundrum is to leave the deep information in the initial conditions of the universe. One could then ask about the source of this information. Yet another possibility is that systems in nature are confined to relatively shallow models.
Facing the realization that models witnessing drops of the structure functions are made of halting information, Vereshchagin and Shen [20] wrote "This looks like a failure. [...] [I]f we start with two old recordings, we may get the same information [about their minimal sufficient statistic], which is not what we expect from a restoration procedure. Of course, there is still a chance that some Ω-number [halting information] was recorded and therefore the restoration process indeed should provide the information about it, but this looks like a very special case that hardly should happen for any practical situation." Facing this, they suggest considering models of more restricted classes or adding some additional conditions theirby looking for "strong models". On the contrary, we think that that the minimal sufficient statistics of two recordings should share information, as they inevitably share a very common origin, which the model aims to capture. That this shared information is about the halting problem simply reflects the fact that their plausible common origin is the fruit of a very long computation, and not that the recording has anything to do with an Ω-number, or any direct representation of the halting problem.
Let us conclude with philosophy of science. Consider the string x to be an encoding of all scientific data ever recorded. Scientific theories aim at explaining x by grasping patterns in the data in order to reduce its redundancy. They are expressed in terms of models that distillate the structures from the apparently noisy boundary conditions. In an effort to tell apart theories or through pure experimental curiosity, the data x always increases. Through conjectures, scientists suggest new models that can better explain the observations. Better theories arise; they find important structures in apparent noise, or unify different models under the same umbrella. These models can either be proven false by some eventually contradicting piece of data, or discarded when a simpler and fitter model is found, namely, one that sits closer to the structure function. Hence the process of scientific investigation may be identified with the upper semi-computation of the structure function of our observations. The computational impossibility of deriving the optimal models from the data is reminiscent of the non-existent scientific induction. Theories are conjectured as tentative best explanations: they can never be proven right, nor optimal, nor final, as mandated by fallibilism.
Proof. We first show the "≤" part of the above relation using a method similar to that which was used in the proof of Theorem 1, namely, we give a program q of length K(K(S i ), log |S i | | x * ) + O(1) that computes S i out of x * . q : Compute K(S i ) and log|S i | from x * Run all programs of length K(S i ) in parallel If p halts with U (p) = S : If log|S| ≤ log|S i | and U (x * ) = x ∈ S: Print S and halt.
To prove the "≥" part of the relation, we invoke the following property [38] (Theorem 69): and because log |S i | is computable from S i with a O(1) advice, Making use of the chain rule (1) and subtracting K(x) on both sides, The conclusion follows from

Appendix A.2. Precisions on the Second Theorem
The following precision of Theorem 2 leverages an alternative way of defining the randomness deficiency, which, because of its conditioning on S * rather than S, favours the precise chain rule Equation (1). Definition A1. (Strong randomness deficiency) The strong randomness deficiency of a string x, with respect to the program S * for the model S * , is Moreover, a very slow N → N function is invoked, the log * (n), defined as the number of times we can iterate taking the binary logarithm with a positive result, starting from n. For instance, log * (17) = 4 because log log log log log 17 < 0 < log log log log 17.
A very tight upper bound to where the sum is taken over all positive terms, and l(x) = |x| alternatively denotes the length of x (see [11], Equation (3.2)). Before revisiting Theorem 2, the following lemma quantifies with more precision the discrepancy between γ andī(γ) = max i {i : i + K(i) + c ≤ γ}, previously defined in Equation (5).
Since i and γ differ by a logarithmic term, they have the same length, plus or minus 1; hence, |i| = |γ| + O(1). Moreover, * (i) can be computed from |i|, so In a similar fashion, and iteratively, Once m is such that m = log * (i), log (m−1) (i) Theorem A2. The height of the i-th drop measures how much more S * i reduces the alternative randomness deficiency, compared to S * i−1 ; i.e., Proof. Using the chain rule in Equation (1) twice, which amounts to a Bayesian inversion, and Theorem A1, Lemma A1 then yields (a glance at Figure 4 can be helpful) Using Equation (A2) and the above, +O(log * (α i − α i−1 )) .

Appendix A.3. Precisions on the Third Theorem and Its Proof
The proof of Theorem 3 hinges on an unexpected relation between the structure function of the string x and an a priori completely different characterization of the patterns in x which invoke the running time of programs that compute it. This last notion is closely related to Bennett's logical depth [39], and the relation between it and the structure function has been established in [22] and further detailed in [20]. For the sake of completeness and precision, a self-contained proof is given here.
In computability theory, the unobtainable knowledge about the halting problem is usually represented as an infinite bit string which plays the role of an oracle. In algorithmic information theory, however, finite pieces of halting information can be represented finitely. One such representation is given by the first values of the busy beaver function. We denote the running time of a program p by RT(p) and the fact that the universal Turing machine halts on input p by U (p) . In words, it is the maximal finite running time of a program of i bits or less. The witness of B(i)-the last program of length i or less that halts-shall be denotedq i .
A program p is a (τ, )-program for x if U (p) = x, |p| = and B(τ − 1) < RT(p) ≤ B(τ). The latter condition means that there is a τ-bit halting program that runs for at least as long as p runs, but none of length τ − 1 or less. Out of all (τ, )-programs, of interest are the ones of minimal for a given τ, or vice versa. For a string x, the time profile is thus defined as the boundary of the region In the spirit of non-probabilistic statistics and their connection with two-part descriptions (see Section 2.2), the model S is a (α, m)-two-part description for x if x ∈ S, K(S) = α and K(S) + log |S| = m. Out of all two-part descriptions, the optimal ones are those of minimal m for a given α-they are recognized as the witnesses of the structure function. The description profile is thus defined as the graph (α, α + h x (α)), and analogously with the time profile, this graph corresponds to the boundary of all (α, m)-descriptions.
Proof idea. It shall first be established that the time profile and the description profile are close to one another. This means that the running time of the two-part descriptions of the minimal partial models is of the busy beaver magnitude. Since most of the computing time happens in the execution of the models-and not in the data-to-model code-their slow running time makes them full of halting information. The later minimal partial models then have most of the information of the prior ones.
Proof. The upcoming two propositions establish that for any data string x, what a priori looks like two different ways of characterizing the structure in x, namely, the time profile and the description profile, are in fact closely related to one another. As a set of points on the plane, they are close to one another. Again for this proof, the displayed equations hold up to an O(1) error term.

Proof.
A two-part description for x is almost a program for x. The first part is given by S * , the second part is i x , and the index of x in S. To make a well-defined program that computes x from the concatenation S * i x , one needs only a O(1) instruction as a preamble. Note that the second part of the code does not need any additional prefix for selfdelimitation, since its length |i x S | = log |S| can be computed (by the preamble) from S * . The resulting program has a length of m + O(1).
A bound on the running time is obtained on the two parts of the code. The first part runs for time ≤ B(α), the most conservative bound for an α-bit program. The second part is fast: its running time is linear in |S|, so for models that are more elaborated than S Babel (i.e., α ≥ K(n)) the running time is at most exponential in n, so smaller than B(α). The conclusion follows from 2B(α) ≤ B(α + O(1)).
Proof. From a (τ, )-program p for x, we can define the set of programs from which we define the model M = {U (q) : q ∈ M } x serving as the (α, m)-two-part description.
This last expression will prove useful later, but it can be further upper-bounded by using the fact that K(q τ ) = τ + O(1). Indeed, it should be first observed that K(q τ ) ≥ τ + O(1), since otherwise, executing the execution ofq * τ would be too short a way of computing for as long as B(τ). Second, K(q τ ) ≤ |q τ | + O(1) = τ + O(1), a statement which is stronger than the upper bound for prefix complexity given in Equation (A1). This stronger upper bound holds becauseq τ is not just any string, but a self-delimiting program, soq τ already contains within itself the information about its length. More precisely, the reference universal machine U could be transformed into an identity machine V defined on the same domain. Since U can then simulate V, the conclusion follows. Therefore, α ≤ K(q τ ) + K( |K(q τ )) ≤ τ + K( |τ) .
(Note that the chain rule as written in Equation (1) conditions on x * . Here, we condition on (x, K(x)), which is informationally equivalent to x * . In fact, (x, K(x)) is computed from x * and a O(1) advice which measures x * before executing it, while x * can be computed from (x, K(x)) by running in parallel all programs of length K(x) until one of them produces x. This program is x * .) The last line was obtained by noticing again that since q is a self-delimiting program, K(q) ≤ |q| + O(1) = + O(1) and by observing thatq τ can be computed from q, τ and an O(1) advice as there is only a constant number of τ-bit programs that halt after q-andq τ is the last one of them. See Proposition A4 for a proof of this claim. In general, the number of strings s with K(s|z) ≤ b is smaller than 2 b+1 − 1, because there are only this many programs that are short enough. Hence, log |M| ≤ log |M | ≤ − K(q τ , ) + K(τ| ) , and so m = K(M) + log |M| ≤ + K(τ| ).
For each point (α, m) of the boundary of the description profile can be drawn the course of (α + O(1), m + O(1)), as well as (α − δ(α), m − (α)). Note that a "drop" of the structure function is defined in Equation (6) precisely to ensure that the time profile has dropped of at least one. This ensures that the data x is logically deeper than expected from the minimal partial models of smaller complexity. The data thus contains more halting information that will be encompassed in the new minimal partial model. Indeed, as can be seen from Figure A1, there is a drop when the description profile drops by more than Q(α i−1 ) = (α i−1 ) + O(1), compared to the previous minimal partial model of complexity level α i−1 . More importantly for this proof, and also readable from Figure A1, the program S * i that computes the minimal partial model at complexity level α i corresponding to the drop has a running time of at least B(α i−1 − δ(α i−1 ) + O(1)), since otherwise, the program (determined by the two-part description) of length m + O(1) runs in a time that contradicts the bounds for the time profile. Hence, for i < j, S i contains at least ξ := α i−1 − δ(α i−1 ) + O(1) bits of irreducible halting information; namely, if ξ is given, the running time of S * i can be used to computeq ξ . Therefore, The model S j contains more halting information than S i , so if ξ is given, S * j can also be used to computeq ξ . Using Equation (A3), Since K(ξ) ≤ K(α i−1 , K(α i−1 ), 2 * * (n)) ≤ K(α i−1 ) + K(K(α i−1 )|α i−1 , K(α i−1 )) + K( * * (n))) ≤ K(α i−1 ) + * * * (n) , we find K(S i |S * j ) ≤ α i − α i−1 + 3K(α i−1 ) + 2( * * (n) + * * * (n)) ≤ α i − α i−1 + 3K(α i−1 ) + O(log log(n)) .

Lemma A2.
If h x (α i − 1) − h x (α i ) ≥ Q(α i − 1), then the minimal partial model S i that witnesses h x (α i ) is such that for any j > i, K(S i |S j ) ≤ 3K(α i ) + O(log log n) ≤ O(log n).
Proof. The proof is analogous to that of Theorem A3. The main difference resides in the bounds of the running time of S * i when the structure function follows a drastic drop, which in this case can be lower bounded by B(α i − δ(α i )). See Figure A2 in contrast to Figure A1. Proof. The proof is by contradiction, exhibiting a too short program forq τ−1 . Suppose that the number of programs of length τ that halt afterq τ−1 is between 2 ν and 2 ν+1 . Then, q τ−1 can be obtained by running all programs of length τ and grouping them as they halt in groups of 2 ν . By assumption, there will be a full group completed with its last halting program running for longer thanq τ−1 . If r denotes the number of this group, thenq τ−1 can be computed from τ, ν and r. The map µ : τ → 2 −τ |{p : |p| = τ & U (p) }| is lower semi-computable and it is a semi-measure, since ∑ τ µ(τ) ≤ 1. Thus, by the coding theorem [11], µ(τ) ≤ 2 −K(τ)+O (1) . Hence, and so |r| ≤ τ − K(τ) − ν + O(1). Therefore, the following relations hold up to O(1): The above holds only if ν − K(ν) = O(1), namely, only if ν = O(1).