The Algorithmic Regulator

Ruffini, Giulio

doi:10.3390/e28030257

Open AccessArticle

The Algorithmic Regulator

by

Giulio Ruffini

^1,2

¹

Brain Modeling Department, Neuroelectrics, 08035 Barcelona, Spain

²

Barcelona Computational Foundation, 08022 Barcelona, Spain

Entropy 2026, 28(3), 257; https://doi.org/10.3390/e28030257

Submission received: 4 December 2025 / Revised: 15 February 2026 / Accepted: 20 February 2026 / Published: 26 February 2026

(This article belongs to the Special Issue Kolmogorov Complexity and Applications—Dedicated to Professor Paul Vitanyi on the Occasion of His 80th Birthday)

Download

Browse Figures

Versions Notes

Abstract

The regulator theorem states that, under certain conditions, any optimal controller must embody a model of the system it regulates, grounding the idea that controllers embed, explicitly or implicitly, internal models of the controlled. This principle underpins neuroscience and predictive brain theories like the Free-Energy Principle or Kolmogorov/Algorithmic Agent theory. However, the theorem is only proven in limited settings. Here, we treat the deterministic, closed, coupled world-regulator system (W,R) as a single self-delimiting program p via a constant-size wrapper that produces the world output string x fed to the regulator. We analyze regulation from the viewpoint of the algorithmic complexity of the output,

K (x)

(regulation as compression). We define R to be a good algorithmic regulator if it reduces the algorithmic complexity of the readout relative to a null (unregulated) baseline ⌀, i.e.,

Δ = K (O_{W, ⌀}) - K (O_{W, R}) > 0 .

We then prove that the larger

Δ

is, the more world-regulator pairs with high mutual algorithmic information are favored. More precisely, a complexity gap

Δ > 0

yields

Pr ((W, R) ∣ x) \leq C 2^{M (W : R)} 2^{- Δ}

, making low

M (W : R)

exponentially unlikely as

Δ

grows. This is an AIT version of the idea that “the regulator contains a model of the world.” The framework is distribution-free, applies to individual sequences, and complements the Internal Model Principle. Beyond this necessity claim, the same coding-theorem calculus singles out a canonical scalar objective and implicates a planner. On the realized episode, a regulator behaves as if it minimized the conditional description length of the readout.

Keywords:

good regulator theorem; cybernetics; algorithmic information theory; Kolmogorov complexity; KT

Graphical Abstract

1. Introduction

In the Kolmogorov Theory (KT) of consciousness, an algorithmic agent is a system that maintains (tele)homeostasis (persistence of self or kind) by learning and running succinct generative models of its world coupled to an objective function and a action planner [1,2,3]. Closely related, Active Inference (AIF) models biological agents as minimizing variational free energy under a generative model [4,5]. These frameworks suggest that “agents with world-modeling engines, objective functions, and planners” are natural minimal models of homeostasis (goal-conditioned setpoint control). But for the kinds of homeostatic systems we actually encounter in nature (cells, organisms, engineered servos), how can we tell—operationally—whether they are algorithmic agents in this sense?

Our results are explicitly distribution-free: we do not assume probability densities, and the relevant “model content” can be inferred from a single realized world–regulator instance via algorithmic mutual information. Closest in spirit are algorithmic statistics and MDL (two-part codes/individual-object sufficient statistics), which formalize “model-from-one-object” without an ensemble [6,7,8,9]. In control, “single-trajectory” data-driven methods can identify LTI dynamics from one sufficiently rich trajectory, but only under strong structural assumptions (e.g., LTI, controllability, persistency of excitation) [10]. To our knowledge, there is no analogous distribution-free result that derives a general necessity claim of internal-model content for arbitrary regulators from single-episode data without such structural assumptions; highlighting this gap makes the present AIT results all the more informative.

The classical cybernetics statement that “every good regulator of a system must be a model of that system” originates with Conant and Ashby’s 1970 paper (the Good Regulator Theorem, GRT) [11]. While influential, the GRT has been criticized for the looseness of its definitions of “model” and “goodness”, and for a proof that does not clearly deliver the headline claim [12]. In modern control theory, the rigorous statement that fills a similar conceptual niche is the Internal Model Principle (IMP): under appropriate hypotheses, perfect regulation or disturbance rejection for a given signal class requires that the controller embed a dynamical copy of the signal generator [13,14,15]. The IMP is precise (and falsifiable) within its scope, and is now a standard backbone for robust control; see [16] for a contemporary review across control, bioengineering, and neuroscience. However, the classical IMP is a linear result: for finite-dimensional LTI plants ( linear, time-invariant meaning the system matrices do not change with time) and exogenous signals generated by a finite-dimensional LTI exosystem, robust asymptotic tracking/disturbance rejection requires that the controller embed a copy of the exosystem dynamics [14]. For nonlinear systems, the appropriate generalization is the nonlinear output-regulation framework: if the regulator equations admit smooth solutions and the plant’s zero dynamics on the regulated manifold are (locally) stable, together with suitable immersion/detectability assumptions, then one can construct dynamic output-feedback regulators that embed a (possibly adaptive) internal model and achieve local or semiglobal robust regulation [17,18,19]. However, absent these structural hypotheses, a complete nonlinear analogue of the IMP with the same necessity/robustness guarantees as in the LTI case is not generally available. Table 1 provides a comparison of the different regulator theorem statements, which can be compared with the one presented here.

In this paper, we recast the modeling requirement in a setting independent of linearity, probability, exact regulation or specific signal classes, by using algorithmic information theory (AIT). We model a world W and a regulator R as deterministic causal Turing machines that interact over interface tapes. We denote the world output by

x = O_{W}

(over some temporal horizon of length N). Our main technical claim is that regulation in the algorithmic sense, i.e., simplicity, forces algorithmic dependence between W and R.

1.1. Definition of Model

A model in the present context is a program capable of compressing (or generating) data. Similarly, “the regulator contains a model of the world” is interpreted in an algorithmic-information sense: the regulator R carries nontrivial information about W, quantified by positive mutual algorithmic information

M (W : R) > 0

(up to the standard

O (log)

slack). Equivalently, knowing R makes the shortest description of W strictly shorter,

K (W ∣ R) < K (W)

. This notion does not require R to embed a dynamical copy of W; rather, it formalizes “model content” as mutual algorithmic information.

We formalize this with the following definition:

Definition 1

(Algorithmic “internal model”). Given a fixed horizon N (implicitly conditioned), we say that R contains an internal model of W in the algorithmic sense if

M (W : R) > 0

(up to

O (log)

), equivalently

K (W ∣ R) < K (W)

. The magnitude of

M (W : R)

quantifies the amount of computable structure in W that R carries.

Here, “computable structure” means reusable regularities that admit short descriptions (rules, symmetries, constraints, mechanism parameters). Saying that Rshares computable structure with W means that knowing R makes W cheaper to describe:

K (W ∣ R) < K (W)

, i.e.,

M (W : R) > 0

. This does not require that the regulator embed a dynamical copy of the exosystem (as in classical IMP); it only requires that the regulator carry algorithmic bits informative about the world’s generative mechanism (e.g., algorithmic elements, time constants, setpoints, disturbance classes, invariances).

The definition grounded on mutual algorithmic information M is further motivated by the following: (i) Machine invariance: M is invariant up to

O (1)

under changes of universal machine. (ii) Distribution-free: M is defined for individual objects (programs), not probabilistic models. (iii) Operational meaning:

M (W : R)

is precisely the codelength reduction in describing W when R is known, aligning with MDL/Occam reasoning via the Coding Theorem [9,20,21].

This is the appropriate lens for our contrastive results, and it complements the Internal Model Principle, where “model” means a dynamical replica of the Exosystem (a part of the World in our framework) under stated structural hypotheses [14,17,18,19]. Conceptually, our AIT result is complementary to the IMP: whereas the IMP states what structural content must be present in a controller to achieve perfect regulation for a given signal class [13,14,15], our results quantify how much algorithmic information the regulator must carry about the world whenever it succeeds in making the measured outcome compressible.

1.2. Regulation as Compression

We score regulation by how compressible a task-weighted error stream is. Let

x_{t}

be the weighted error and

x_{1 : T}

the T-sample string. Fix a prefix-free lossless code (e.g., a universal compressor) and define the per-sample codelength

L_{T} : = \frac{1}{T} L_{C} (x_{1 : T})

. A regulator R is better on horizon T when it makes

L_{T}

smaller than a null baseline ⌀, i.e., when the contrastive gap

Δ : = L_{T} (x; ⌀) - L_{T} (x; R)

is positive. This choice is natural: for stationary ergodic data, normalized universal codelengths converge almost surely (i.e., with probability one) to the Shannon entropy rate

h (x)

, and (under standard computability assumptions)

K (x_{1 : T}) / T = h (x) + o (1)

almost surely; thus the Kolmogorov-based criterion reduces to the Shannon criterion when those stochastic assumptions hold—while remaining meaningful outside them [9,22,23].

To see the connection between regulation and compression in more detail, let

h_{x_{1 : T}} (α) : = {min}_{S ∋ x_{1 : T}, K (S) \leq α} log | S |

denote the Kolmogorov structure function [24]. Regulation amounts to moving down this curve: as the regulator invests model bits (larger

α

), more regularity in

x_{1 : T}

is captured and the residual randomness

h_{x_{1 : T}} (α)

drops, approaching 0 at perfect regulation. The notion of robability emerges along this path. Replacing set-models by probabilistic models

{P_{M}}

turns the two-part description into the standard MDL form

L (x_{1 : T}; M) \approx K (M) - \sum_{t = 1}^{T} log P_{M} (x_{t}),

where the second term is the ideal codelength under

P_{M}

(Shannon coding:

- log P_{M}

) [8,9,23]. If the regulator must hedge over multiple models, the mixture/Bayes code with prior

π

uses

\bar{P} (x_{1 : T}) = \sum_{M} π (M) P_{M} (x_{1 : T})

and assigns

L_{mix} (x_{1 : T}) = - log \bar{P} (x_{1 : T}) = - log \sum_{M} π (M) P_{M} (x_{1 : T}),

a valid prefix code whose regret relative to the best single model

M^{⋆}

is bounded by

- log π (M^{⋆})

; with

π (M) \propto 2^{- K (M)}

(Solomonoff/Occam), the penalty matches the model description length [7,8,9,20,25]. Thus, probabilistic/Bayesian regulation is the coding-optimal way to descend

h_{x_{1 : T}} (α)

, aligning with the multi-model argument in [26].

Finally, we can treat the regulator input as an error signal quantized at fixed sensor resolution; the per-sample codelength of

x_{1 : T}

under a universal compressor converges to the entropy rate for stationary sources. For Gaussian processes,

h (x) = \frac{1}{4 π} \int_{- π}^{π} log (2 π e S_{x x} (ω)) d ω,

where

S_{x x}

is the input power spectral density [22,27] of the error signal x, so attenuating in-band sensitivity (reducing

S_{x x}

where it matters) reduces codelength [27]. In the scalar white-Gaussian case with variance

σ^{2}

,

h = \frac{1}{2} log (2 π e σ^{2})

, so smaller-amplitude fluctuations (smaller

σ

) mean lower entropy and better compressibility. In short, “compressible error” matches the classical view: good regulation removes variability/uncertainty in the task band and accords with the IMP [14,28].

In the next sections, we first provide an overview of the AIT setting and of the results, followed by the analysis of the single episode scenario. The next section provides a formal definition of the algorithmic regulator and the corresponding theorem.

2. Setting

Unless stated otherwise, U is the standard three–tape universal prefix Turing machine: a read-only input tape holding a self-delimiting program p, a work tape (private scratch memory), and a write-only output tape. When we write

U (p) = x

we mean that, upon halting, the contents of the output tape equal x; the work tape is never part of the scored output. The domain of halting programs is prefix-free, so Kraft–McMillan applies and the universal a priori semimeasure

m (x) = \sum_{U (p) = x} 2^{- | p |}

is well defined. By the invariance theorem, replacing U by any other universal prefix machine (single- or multi-tape) changes all complexities only by an additive

O (1)

; all Coding-Theorem statements we use depend only on prefix-freeness and therefore remain valid up to these constants (see, e.g., [9]). Our use of prefix-free (self-delimiting) programs is a standard coding convention in AIT: it ensures that program lengths satisfy Kraft’s inequality and lets the coding theorem link description length and universal probability cleanly. Any finite description can be made self-delimiting with only

O (log n)

overhead for length n, so this is not a substantive restriction on real systems; it is a technical requirement of the description language (see Appendix A.3) [9].

The (prefix) Kolmogorov complexity of x is the length of its shortest description,

K (x) : = min {| p | : U (p) = x} .

Intuitively,

K (x)

is the best achievable compressed size of x on U. If

K (x) ≪ | x |

, then x has a short generative regularity; if

K (x) \approx | x |

, x is (algorithmically) random. By the invariance theorem, K is machine-independent up to an additive constant

O (1)

[9]. A fundamental limitation is that

K (\cdot)

is not computable: no algorithm can output

K (x)

for all x [9,29]. However, algorithms for upper bounds of

K (x)

exist, as we discuss below.

Given auxiliary data y on a read-only auxiliary tape, the conditional complexity

K (x ∣ y) : = min {| p | : U (p, y) = x}

is the shortest description of x given y. It operationalizes how much new information is needed to reconstruct x once y is known (e.g., “world given regulator,” or “output given model”).

The mutual algorithmic information (up to the usual

O (log)

slack) is

M (x : y) : = K (x) + K (y) - K (x, y) = K (x) - K (x ∣ y) = K (y) - K (y ∣ x) \pm O (log) .

M (x : y)

measures the algorithmically shared structure between x and y: how many bits we save when describing one with the help of the other. In our setting, “the regulator contains a model of the world” means

M (W : R) > 0

(information-theoretic dependence); this is a necessary information-theoretic analogue of “model” containment that complements—rather than replaces—the classical IMP notion of a dynamical replica.

Intuitively, strings produced by shorter programs are more likely. Solomonoff–Levin’s universal a priori semimeasure

m (x)

and the Coding Theorem link probability and description length,

- {log}_{2} m (x) = K (x) \pm O (1),

(1)

providing a universal Occam calculus over individual strings [9,20,21,25].

In what follows, a finite temporal horizon N is fixed throughout; unless stated otherwise, we implicitly condition on N (e.g., write

K (x)

for

K (x ∣ N)

). All

O (1)

constants depend only on the choice of U (and the fixed constant-overhead wrapper that decodes

(W, R)

and simulates their coupling to print the readout), never on particular strings; see Appendix A.2.

The Coupled World-Regulator System

We work with 3-tape Turing machines W and R (see Figure 1 and Appendix A.2). We identify each machine with its minimal self–delimiting program (

| W | = K (W)

,

| R | = K (R)

) [9]. A horizon

N \in N

is fixed and all complexities are conditioned on N unless otherwise stated. W and R interact causally for N steps, producing a deterministic readout

O_{W, R}^{(N)} \in {0, 1}^{N}

. The dynamical equations are

O_{W} = W (O_{R}), O_{R} = R (O_{W}) .

(2)

The performance of the regulator is evaluated from the complexity of the output,

K (x)

. Intuitively, a good regulator produces outputs of lower complexity than the unregulated case. Since

x = O_{W, R}^{(N)}

is computable from

(W, R, N)

,

K (x) \leq K (W, R) + O (1) = K (W) + K (R) - M (W : R) + O (1) .

(3)

To disentangle the role of R from the coarse event “

K (O^{(N)})

is small,” we fix a null regulator ⌀ (where R’s output is set to zero). We compare the events

E_{a}^{R} : K (O_{W, R}^{(N)}) = a vs E_{b}^{⌀} : K (O_{W, ⌀}^{(N)}) = b,

(4)

with

b > a

. Event

E_{b}^{⌀}

rules out worlds that produce a simple output without regulation; the intersection

E_{a}^{R} \land E_{b}^{⌀}

isolates R’s contribution.

For notational simplicity, the on–case and off–case readouts are also expressed as

x : = O_{W, R}^{(N)}, y : = O_{W, ⌀}^{(N)} .

For a fixed time horizon, we write

O_{W}

for the full output produced by W when coupled to R.

In the next sections, we provide our main results regarding mutual information between the world and regulator, and implications for inferring agent-like behavior in the regulator.

3. Probabilistic Regulator Theorems

3.1. Posterior Form, Given the Observed x

Lemma 1

(Program posterior given x). With prefix prior

P (p) = 2^{- | p |}

and deterministic likelihood

P (x ∣ p) = 1 {U (p) = x}

,

P (p ∣ x) = \frac{2^{- | p |}}{m (x)} .

Consequently, by (5),

\frac{1}{c_{2}} 2^{K (x) - | p |} \leq P (p ∣ x) \leq \frac{1}{c_{1}} 2^{K (x) - | p |} .

Proof.

For any finite string x,

K (x) : = min {| p | : U (p) = x}, m (x) : = \sum_{p : U (p) = x} 2^{- | p |} .

(recall Equation (1)). The Coding Theorem gives machine-dependent constants

c_{1}, c_{2} > 0

with

c_{1} 2^{- K (x)} \leq m (x) \leq c_{2} 2^{- K (x)} .

(5)

Now, briefly, Bayes’ rule yields

Pr {p ∣ x} = 2^{- | p |} / m (x)

; apply (5). In more detail, place the prefix prior

P (p) = 2^{- | p |}

on programs p and use the deterministic likelihood

P (x ∣ p) = 1 {U (p) = x}

. Then, the evidence is

P (x) = m (x)

and the posterior is

P (p ∣ x) = \frac{P (x ∣ p) P (p)}{P (x)} = \{\begin{matrix} \frac{2^{- | p |}}{m (x)}, & U (p) = x, \\ 0, & otherwise . \end{matrix}

Then, for any p with

U (p) = x

,

\frac{1}{c_{2}} 2^{K (x) - | p |} \leq P (p ∣ x) = \frac{2^{- | p |}}{m (x)} \leq \frac{1}{c_{1}} 2^{K (x) - | p |} .

The relation between

K (x)

and

m (x)

holds only up to an additive

O (1)

term in K, which becomes a multiplicative constant on

m (x)

. This

O (1)

ambiguity is unavoidable and depends on the choice of universal prefix machine U;

c_{1}, c_{2}

absorb exactly this machine-dependent slack. □

Now, in our setting the world W and regulator R are programs that interact for N steps, producing the on-case readout

x : = O_{W, R}^{(N)}

. A fixed, constant-overhead wrapper decodes a shortest description of

(W, R)

and simulates the coupling to print x (decode + simulate); if

p_{W, R}

denotes this canonical code, then

| p_{W, R} | = K (W, R) + O (1), P ((W, R) ∣ x) \in [\frac{1}{{\tilde{c}}_{2}}, \frac{1}{{\tilde{c}}_{1}}] \cdot 2^{K (x) - K (W, R)},

(6)

for constants

{\tilde{c}}_{i} : = 2^{O (1)} c_{i}

.

Now we can use the definition of mutual algorithmic information (up to the usual

O (log)

slack) to write

M (W : R) = K (W) + K (R) - K (W, R)

and derive our first result:

Theorem 1.

P ((W, R) ∣ x) \in [\frac{1}{{\tilde{c}}_{2}}, \frac{1}{{\tilde{c}}_{1}}] \cdot 2^{K (x) - K (W) - K (R) + M (W : R)} < \frac{1}{\tilde{c}} 2^{M (W : R)}

(7)

3.2. The Good Algorithmic Regulator and Posterior with Contrast

For our second result, we first define the Good Algorithmic Regulator (GAR).

Definition 2

(Good Algorithmic Regulator, contrastive). Given the on/off complexities and gap

a : = K (O_{W, R}^{(N)}), b : = K (O_{W, ⌀}^{(N)}), Δ : = b - a .

we say that R is a good algorithmic regulator of gap

Δ

for W at horizon N if

Δ > 0

.

Lemma 2

(OFF run lower-bounds the world). There exists

c_{0} = O (1)

such that

K (O_{W, ⌀}^{(N)}) \leq K (W) + c_{0} \Rightarrow K (W) \geq b - c_{0} .

Proof.

Given

(W, ⌀, N)

, the wrapper simulates the OFF dynamics and prints

O_{W, ⌀}^{(N)}

with

O (1)

overhead. □

With this definition we can now state and prove our main theorem.

Theorem 2

(Probabilistic regulator theorem). Let

O_{W, R}^{(N)}

and

E_{b}^{R}

be observed and let

Δ : = K (O_{W, ⌀}^{(N)}) - K (O_{W, R}^{(N)})

. Then, there exists

C > 0

such that

P ((W, R) ∣ O_{W, R}^{(N)}, E_{b}^{R}) \leq C \cdot 2^{M (W : R)} 2^{- Δ} .

Equivalently, every bit by which

M (W : R)

falls short of Δ costs a factor

\approx 2^{- 1}

in posterior support.

Proof.

(i) Posterior via wrapper. From Equation (6),

{log}_{2} P ((W, R) ∣ x) \leq K (x) - K (W, R) + O (1) = a - K (W, R) + O (1) .

(ii) Decompose

K (W, R)

. We use the exact mutual information

M (W : R) : = K (W) + K (R) - K (W, R)

,

K (W, R) = K (W) + K (R) - M (W : R),

hence

K (x) - K (W, R) = a - K (W) - K (R) + M (W : R) .

(iii) Insert OFF bound (where b enters). By Lemma 2,

K (W) \geq b - c_{0}

, so

K (x) - K (W, R) \leq M (W : R) - (b - a) - K (R) + c_{0} = M (W : R) - Δ - K (R) + c_{0} .

(iv) Exponentiate and absorb constants. Exponentiating and using

2^{- K (R)} \leq 1

gives

P ((W, R) ∣ x, E_{b}^{R}) \leq C 2^{M (W : R)} 2^{- Δ}

for a constant C absorbing

2^{c_{0}}

and the wrapper Coding-Theorem constants. □

Clarifications:

(i) Where does b appear? Only via Lemma 2, which says the OFF run lower-bounds

K (W)

. We never need to compute b explicitly. (ii) Why can we drop

2^{- K (R)}

? A slightly sharper bound is

P ((W, R) ∣ x, E_{b}^{R}) \leq C 2^{M (W : R)} 2^{- Δ} 2^{- K (R)}

. Since

K (R) \geq 0

, dropping

2^{- K (R)} \leq 1

keeps the focus on the two interpretable scalars M and

Δ

without changing the exponential scaling. (iii) Architecture-agnostic. The proof only uses the computable wrapper

(W, R, N) \mapsto x

. Whether R is open- or closed-loop does not affect the posterior algebra. (iv) The posterior on the left of Theorem 2 is conditioned on the on-case observation x only. The off-case run is used solely to supply a numeric lower bound

b : = K (O_{W, ⌀}^{(N)})

, which implies

K (W) \geq b - O (1)

by simulation. Formally, we phrase the result as a bound on

Pr ((W, R) ∣ x, E_{b}^{R})

, where

E_{b}^{R}

is the side-event “

K (O_{W, ⌀}^{(N)}) = b

”.

As a consequence of Theorem 2, one can bound individual posterior masses by

O (2^{K (x) - K (W, R)})

. This implies an exponential tail:

Pr M (W : R) \leq Δ - k = O (2^{- k})

. In other words,

M (W : R)

is concentrated within

O (1)

of its maximum

Δ

. I.e., there exists

C^{'} > 0

(machine/wrapper dependent only) such that for all integers

k \geq 0

,

Pr \{M (W : R) \leq Δ - k | x, E_{b}^{R}\} \leq C^{'} 2^{- k} .

How to read (and use) Theorem 2:

What we measure: compute the on/off complexities $a = K (O_{W, R}^{(N)})$ and $b = K (O_{W, ⌀}^{(N)})$ (in practice: fixed MDL code lengths); their difference $Δ = b - a$ is the compressibility advantage.
What the bound says: for any explanation $(W, R)$ of the observed x, the universal posterior weight is penalized as $2^{- Δ}$ unless the pair shares structure: larger $M (W : R)$ compensates the penalty.
Practical rule of thumb: sustained large $Δ$ across tasks makes low $M (W : R)$ exponentially unlikely. If off-case b is already small, $Δ$ will be small—choose a diagnostic readout so the null is not trivially simple.

3.3. Inferring the Objective Function and Planner (As-If Agent)

We next provide a simple theorem regarding the role of complexity as an objective function.

Theorem 3

(On/Off evidence equals unconditioned complexity gap). Under the universal a priori semimeasure,

{log}_{2} \frac{m (O_{W, R}^{(N)})}{m (O_{W, ⌀}^{(N)})} = K (O_{W, ⌀}^{(N)}) - K (O_{W, R}^{(N)}) \pm O (1) .

(8)

Equivalently, writing the on/off gap as

Δ : = K (O_{W, ⌀}^{(N)}) - K (O_{W, R}^{(N)}),

we have

m (O_{W, R}^{(N)}) / m (O_{W, ⌀}^{(N)}) = Θ (2^{Δ}) .

Hence, on the realized pair

(O_{W, R}^{(N)}, O_{W, ⌀}^{(N)})

, maximizing the likelihood of “ON over OFF” is equivalent (up to a constant factor) to minimizing

K (O_{W, R}^{(N)})

or, equivalently, maximizing the gap Δ.

Proof.

By the Coding Theorem there exist machine-dependent constants

c_{1}, c_{2} > 0

such that

c_{1} 2^{- K (z)} \leq m (z) \leq c_{2} 2^{- K (z)}

for any string z. Apply this to x and

O_{W, ⌀}^{(N)}

, take base-2 logs, and subtract:

- {log}_{2} m (O_{W, R}^{(N)}) = K (O_{W, R}^{(N)}) \pm O (1), - {log}_{2} m (O_{W, ⌀}^{(N)}) = K (O_{W, ⌀}^{(N)}) \pm O (1),

so

{log}_{2} \frac{m (O_{W, R}^{(N)})}{m (O_{W, ⌀}^{(N)})} = K (y) - K (O_{W, R}^{(N)}) \pm O (1) .

□

This statement compares two different strings (the realized ON and OFF outputs) and aligns with the contrastive quantities used elsewhere. The log universal Bayes factor for “ON vs. OFF” is seen to equal the complexity gap

Δ \pm O (1) .

Thus, on each episode, a regulator behaves as if it were maximizing the scalar

Δ,

equivalently minimizing

K (O_{W, R}^{(N)})

.

Thus, given a regulator R that persistently reduces the readout’s complexity relative to a null baseline ⌀ (the GAR setting of Definition 2), we can justify—on purely observational grounds—that R behaves as if it were minimizing a scalar objective. The objective should be canonical (not post hoc) and usable across episodes/tasks.

4. Discussion

Classical control theory—especially in the LTI case—provides powerful constructive synthesis methods yielding transparent regulator architectures under explicit model classes. Our results are different in scope: they give a distribution-free, single-instance necessity/diagnostic statement. If regulation induces a nontrivial contrastive compressibility gap, then the regulator R must carry algorithmic information about the world W.

However, this theoretical necessity does not by itself provide a controller design algorithm, and its practical application relies on computable surrogates for Kolmogorov complexity, such as Lempel-Ziv compression [22] or neural autoencoders [30,31]. Furthermore, to detect agency in a biological or social system, or in digital life systems—such as Conway’s Game of Life [32] or continuous cellular automata like Lenia [33,34]—one must first tentatively define a “membrane” (Markov blanket [5,35,36]) that separates the putative agent R from its environment W. By inspecting the input-output stream across various candidate boundaries, we can identify agents as those subsystems that maximize the compressibility gap

Δ

. In this sense, while probabilistic and model-class-based approaches remain indispensable for constructive designs and performance guarantees, our AIT framework acts as an “outer layer” diagnostic that characterizes when “having a model” (in the information-theoretic sense) is unavoidable.

We summarize now our results:

First regulator result: posterior form, given the observed x (Theorem 1).

By Solomonoff induction and the Coding Theorem [20,21,25,37,38], we showed that

Pr ((W, R) ∣ x) = \frac{2^{- K (W, R) + O (1)}}{m (x)} \sim 2^{K (x) - K (W, R)} < \frac{1}{\tilde{c}} 2^{M (W : R)}

(9)

Thus shorter joint generators are exponentially preferred; every extra bit in

K (W, R)

halves the posterior weight. Decomposing

K (W, R) = K (W) + K (R) - M (W : R) \pm O (log)

(10)

shows that, at fixed marginals

K (W), K (R)

, the posterior is exponentially tilted in the algorithmic mutual information

M (W : R)

: each extra bit of

M (W : R)

multiplies posterior odds by

\approx 2

.

Second regulator result: posterior with contrast (Theorem 2).

Without contrast, the story is pure Occam: (9) anchors the posterior near

K (W, R) \approx K (x)

with a geometric excess-length tail; for fixed

K (W), K (R)

, this yields a high-probability lower bound on

M (W : R)

roughly

K (W) + K (R) - K (x)

. With contrast, if turning the regulator on yields

K (O_{W, R}^{(N)}) = a

while the off case has

K (O_{W, ⌀}^{(N)}) = b

with

b > a

, then any explaining

(W, R)

obeys

Pr ((W, R) ∣ x) \leq C 2^{M (W : R)} 2^{- Δ},

so low mutual information is exponentially disfavored as the gap

Δ = b - a

grows. In both regimes, the operational slogan holds: see a simple string (

K (x)

small), suspect a simple generator (

K (W, R)

small), and at fixed marginals, this means suspect larger

M (W : R)

.

The intuition behind these results is that seeing a simple string suggests its generation by a simple program. Formally, for the coupled hypothesis

P = (W, R)

(wrapped as a single self-delimiting program), observing

x = O_{W}^{(N)}

yields the Solomonoff posterior

Pr (P ∣ x) \sim 2^{K (x) - K (P)},

by the Coding Theorem [20,21,25,37,38]. Every extra bit of joint description

K (P) = K (W, R)

halves posterior weight. This is the quantitative Occam tilt that operationalizes the slogan above.

The posterior mass of joint programs longer than

K (x) + k

decays geometrically:

Pr {K (W, R) \geq K (x) + k ∣ x} \leq 2 C 2^{- k} .

Hence, the typical joint length is near

K (x)

. If

K (W)

and

K (R)

are externally constrained (e.g., by design or prior knowledge), this tail translates directly into a lower posterior bound on

M (W : R)

of the form

M (W : R) ≳ K (W) + K (R) - K (x) - O (log (1 / δ))

with posterior confidence

1 - δ

.

Our results are most informative when the observed readout

O_{W}^{(N)}

is simple. If

K (O_{W}^{(N)})

is large, the posterior constraints on joint complexity and on mutual information are inherently weak. From the geometric tail, for any

δ \in (0, 1)

there exists

k = ⌈ {log}_{2} (2 C / δ) ⌉

such that, with posterior probability at least

1 - δ

,

K (W, R) \leq K (O_{W}^{(N)}) + k .

At fixed marginals

K (W)

and

K (R)

this yields

M (W : R) \geq K (W) + K (R) - K (O_{W}^{(N)}) - k - O (log) with probability \geq 1 - δ .

Hence, if

K (O_{W}^{(N)})

is large (comparable to

K (W) + K (R)

), the lower bound on

M (W : R)

may be trivial (near 0 up to logs). Intuitively, a complex output does not force shared structure. It is compatible with a complex joint generator even when W and R share little algorithmic information.

On the other hand, the strength of the conclusion depends on the gap

Δ = b - a

:

Pr ((W, R) ∣ O_{W}^{(N)}, E_{b}^{R}) \leq C 2^{M (W : R)} 2^{- Δ}, Pr \{M (W : R) \leq Δ - k | O_{W}^{(N)}, E_{b}^{R}\} \leq C^{'} 2^{- k} .

Thus even if

a = K (O_{W, R}^{(N)})

is not very small, a large off/on gap still enforces a large posterior

M (W : R)

. In other words, contrast rescues identifiability of shared structure: the evidence scales exponentially in

Δ

.

In the same universal calculus, regulation carries a canonical scalar interpretation: runtime behavior is as if minimizing

K (O_{W}^{(N)})

(i.e., maximizing the on/off gap

Δ

), and design-time comparison across explanations favors larger

M (W : R) - Δ

via the GAR posterior tilt. This supplies an MDL/Occam objective grounded in the coding theorem (not an ad hoc utility) and complements the IMP’s structural requirements.

We note that a low

K (O_{W}^{(N)})

alone does not prove high

M (W : R)

; it concentrates posterior mass on short joint generators P. High

M (W : R)

follows (i) when

K (W)

and

K (R)

are fixed/known, or (ii) when contrast pins

K (W)

high via the off case. Without such constraints, short P could also arise from individually simple W and R.

Third regulator result: as-if Objective-function minimization (Theorem 3).

On the realized

O_{W}^{(N)}

, the conditional Coding Theorem gives

{log}_{2} (m (O_{W}^{(N)}) / m (O_{W, ⌀}^{(N)})) = K (O_{W, ⌀}^{(N)}) - K (O_{W}^{(N)})

. Thus, the runtime scalar to minimize is

K (O_{W}^{(N)})

. Together with the above, this implies that the regulator is acting (as-if) like an algorithmic agent (with a model of the world, objective function and planner).

Theorem 3 is a representation statement—not a mechanism: R need not compute K, but persistent large

Δ

is exactly what maximizes universal evidence for “ON”, and it simultaneously makes low

M (W : R)

exponentially unlikely. For a mechanistic objective beyond the Minimum Description Length (MDL) evidence, three constructive routes are standard and complementary. First, in Linear Time-Invariant (LTI) plants the Internal Model Principle makes a structural claim—perfect robust regulation for a specified signal class requires embedding a dynamical copy of the exosystem in the controller—and optimal stabilizing designs arise from explicit quadratic/convex costs (e.g., the Linear Quadratic Regulator, LQR); in the nonlinear case, output-regulation theory yields constructive regulators under solvable regulator equations together with immersion/detectability and (local) zero-dynamics stability [13,14,15,17,18,19,39]. Second, in inverse optimal control and inverse reinforcement learning (IRL), trajectories that satisfy Karush–Kuhn–Tucker (KKT) regularity allow identification of a cost J (up to equivalences) whose minimizers reproduce the behavior; in discrete settings, IRL recovers reward functions consistent with observed policies [40,41,42]. Third, in revealed-preference analysis, if cross-episode choices satisfy the Generalized Axiom of Revealed Preference (GARP), Afriat and Varian guarantee the existence of a strictly increasing, concave utility that rationalizes the data, while Debreu’s representation and the Savage/Karni–Schmeidler frameworks provide (state-dependent) expected-utility forms under their axioms [43,44,45,46,47].

Planner/policy representation (as-if agent).

Any deterministic causal regulator R induces a computable policy

π_{R} : H_{t} \to A

mapping the coupled history

h_{t}

(past interface I/O up to time t) to the next actuator symbol. This is simply the operational semantics of R viewed as a function of histories.

The coding-theorem Bayes-factor identity (Theorem 3) supplies a canonical scalar such that, on the realized episode, the sequence of actions produced by

π_{R}

is as if chosen to maximize J subject to the world dynamics. Together with the algorithmic “internal model” conclusion

M (W : R) > 0

(i.e.,

K (W ∣ R) < K (W)

), this yields the standard agent triad:

(model) M (W : R) > 0, (objective) J (x) = K (y) - K (x), (policy / planner) π_{R} .

Interpretation. This is a representation statement, not a claim that R explicitly solves an optimization problem or contains a modular planner. The existence of

π_{R}

is tautological for any deterministic R; the “as-if” objective follows from the universal evidence identity above. Across tasks/episodes, if the induced choices satisfy standard consistency axioms (e.g., GARP), classical revealed-preference theorems guarantee the existence of a (monotone, concave) utility that rationalizes the behavior [43,44]; and in dynamical settings, inverse optimal control/inverse RL constructs a cost for which the observed policy is (near-)optimal [40,41]. Thus, given (i) algorithmic model content

M (W : R) > 0

and (ii) the canonical scalar J from the coding-theorem calculus, interpreting the regulator as carrying a policy/planner is both natural and technically justified.

4.1. Why AIT Is Needed

Our results are single-episode and distribution-free: they make statements about an individual realized readout x and about the pair

(W, R)

as concrete programs, without positing a stochastic source. Classical (Shannon) information theory quantifies expected code lengths and mutual information with respect to a specified probability law; entropy

H (X)

and mutual information

I (X; Y)

are undefined without a distribution, and asymptotic statements (AEP/typical sets) further require ergodicity/mixing assumptions [23]. In our setting, there is no given probabilistic model over worlds, regulators, or outputs—indeed, the point is to infer model content from a single realized x.

AIT supplies exactly the missing calculus. First, it provides a canonical, machine-invariant complexity for individual strings,

K (x)

, and a universal a priori semi measure

m (x)

(Solomonoff– Levin), connected by the Coding Theorem:

- log m (x) = K (x) \pm O (1)

[20,21,25]. This yields a universal Occam posterior over programs,

Pr (p ∣ x) ≍ 2^{K (x) - | p |},

from which (i) the geometric excess-length tail and (ii) our contrastive tilt bounds follow. No analogue exists in Shannon’s framework without positing an external prior over programs; there is no “canonical”

Pr (p)

or

Pr (x)

in Shannon theory.

Second, AIT lets us formalize “the regulator contains a model of the world” as algorithmic dependence, i.e., positive mutual algorithmic information

M (W : R) > 0

(equivalently

K (W ∣ R) < K (W)

), a notion defined for individual objects and invariant up to

O (1)

[9]. By contrast, Shannon’s

I (W; R)

requires a joint distribution over

(W, R)

, which is neither given nor natural here.

Third, our key inequalities explicitly use

m (\cdot)

and prefix complexity: the posterior tilt

2^{K (x) - K (W, R)}

, the OFF-run lower bound on

K (W)

by simulation, and the contrastive penalty

2^{- Δ}

all rely on the Coding Theorem and Kraft–McMillan properties of prefix programs—again, objects absent from Shannon’s ensemble-level calculus.

Finally, while one can approximate

K (\cdot)

with MDL/codelengths in practice, MDL’s justification itself rests on the AIT view that shorter descriptions are better and on the coding-theorem linkage between description length and (universal) probability [8]. In short: AIT provides the universal prior (m), object-level complexities (K), and mutual algorithmic information (M) needed to turn the informal slogan “see a simple string, suspect a simple generator” into posterior and contrastive theorems—none of which can be stated in Shannon’s framework without ad hoc model classes and priors.

4.1.1. Relation to the Internal Model Principle (IMP)

In the IMP, the closed loop is

(E, C, P)

: an autonomous exosystem E (no inputs and no explicit time dependence, e.g.,

\dot{w} = S w

), a controller C (the regulator), and a plant P. The regulated error is

e = r - y

, where the reference r and disturbances are generated by E and y is measured from P [13,14]. In our notation, we group the World as

W = (E, P)

and take the Regulator as

R \equiv C

(see Figure 2 and Table 2 for the comparison of the two frameworks in the case of a thermostat).

The assumptions in the IMP theorems are: (i) Classical necessity is sharpest for finite-dimensional LTI plants (linear, time-invariant) with exogenous signals generated by a finite-dimensional, neutrally stable LTI E; stabilizability/detectability and robustness (one fixed C works for a plant neighborhood) are standard [13,14]. (ii) The structural conclusion is internal-model necessity: perfect robust regulation for the specified signal class requires that C embed a dynamical copy of E (e.g., integrators for steps, oscillators for sinusoids); in MIMO, a p-copy is needed. (iii) Nonlinear generalizations (output regulation) require solvability of the regulator equations, suitable immersion/detectability, and (local) stability of the zero dynamics; guarantees are typically local/semiglobal, and necessity is not universal [17,18,19]. (iv) Infinite-dimensional/distributed settings and periodic signals may require infinite-dimensional internal models; technicalities arise with unbounded I/O operators [16].

In the AIT formulation (here), we assume: (i) Architecture-agnostic: no required split into E vs. P, and no specified place where R enters the causal path; we only assume a computable wrapper mapping

(W, R, N) \mapsto O_{W}

for a fixed horizon N. (ii) Deterministic, closed coupling of world and regulator (no stochastic noise sources into W); statements are distribution-free and about the realized sequence. (iii) “Model” means algorithmic dependence:

M (W : R) > 0

(equivalently

K (W | R) < K (W)

), not a literal dynamical replica. (iv) The main necessity is probabilistic: a positive on/off complexity gap

Δ = K (O_{W, Ø}) - K (O_{W, R})

exponentially tilts the universal posterior against explanations with small

M (W : R)

; no linearity, smoothness, or regulator-equation conditions are imposed. See Section 2, Section 3, Section 4 and Section 5 and Appendix A of this work.

IMP yields a structural necessity (internal model in C of E) under explicit dynamical hypotheses; the AIT formulation yields an information-theoretic necessity (positive

M (W : R)

favored by the data) without assuming linearity, an

E / P

split, or a particular causal insertion point for R. The two are complementary: IMP is the backbone for constructive regulation in structured classes; the AIT view covers unstructured architectures and single episodes with a universal Occam calculus [13,14,15,16,17,18].

The home thermostat.

As an example, consider a home thermostat as a regulator/controller. Let P be the living room + heater dynamics (thermal capacitance, heat loss, delays) and E the exogenous processes (setpoint schedule, outdoor weather/solar, occupancy). The Internal Model Principle (IMP) states that exact output regulation for a specified signal class is possible only if the controller embeds a copy of the exosystem E that generates those signals (e.g., an integrator for steps, an oscillator for a fixed sinusoid); plant knowledge is used for stabilization/shaping, but the IMP necessity targets E itself [14,17]. In our AIT view, a regulator R is “good” when it makes the realized readout more compressible than a null baseline; a sustained compressibility gap implies that R shares computable structure with the whole world

W = (P, E)

:

M (W : R) = M ((P, E) : R) = M (P : R) + M (E : R ∣ P) \pm O (log) .

A simple on/off thermostat with a deadband tuned to the room time constant typically yields a bounded limit cycle (not zero steady-state error); under IMP, it lacks the needed internal model of constants (no embedded integrator), hence it does not achieve exact regulation of the “constant” class [14,28]. Nevertheless, in the AIT sense, it still qualifies as a regulator: its policy encodes a very compressed model spanning P (heating raises T, room inertia) and weak regularities in E (quasi-constant setpoint, slowly varying weather), giving

M (W : R) > 0

[11]. PI/PID or predictive thermostats remedy the IMP shortfall by embedding the appropriate internal model (and often explicit models of P and aspects of E) [28].

The AIT regulator framework (as well as the original GRT) is therefore more general than IMP: the regulator must carry a model of the world

W = E \cup P

, where P is the plant (house/HVAC thermodynamics) and E the exogenous processes (setpoint schedule, weather/solar/occupancy), and IMP is recovered as a special case when the performance target is exact output regulation over a specified signal class. Under IMP, a controller qualifies for exact regulation only if it embeds a dynamical copy of the exosystem that generates the reference/disturbances (e.g., an integrator for steps, oscillators for sinusoids)—no model of P is required beyond stabilizability/detectability [14]; nonlinear output regulation extends this under additional immersion/detectability and regulator-equation solvability assumptions [17].

Our statements are thus complementary and distinct: in AIT, we work in a distribution-free, program-level setting and make no linearity or smoothness assumptions. We remain agnostic about what the regulator needs to model and do not demand exact regulation. We do not assert the existence of a dynamical replica inside R. Instead, we show that sustained contrastive compressibility (

Δ > 0

) tilts the universal posterior toward pairs

(W, R)

with larger mutual algorithmic information

M (W : R)

, i.e., R carries algorithmic structure about W. Thus, “the regulator contains a model” is made precise as

M (W : R) > 0

(information-theoretic dependence), not as an embedded exosystem. The IMP supplies structural necessity for perfect regulation within specified signal classes; our AIT results supply information-theoretic necessity for observed compressibility advantages, beyond linearity or probabilistic assumptions [15].

4.1.2. Practical Estimation of K and the Gap $Δ$

Our theorems are stated in terms of prefix Kolmogorov complexity, which is not computable. In practice, one can fix a reference prefix code C and estimate upper bounds,

\hat{a} : = L_{C} (O_{W, R}^{(N)}), \hat{b} : = L_{C} (O_{W, ⌀}^{(N)}), \hat{Δ} = \hat{b} - \hat{a},

with the same compressor C used across all conditions. Persistent

\hat{Δ} > 0

across tasks is cumulative evidence that explanations with low

M (W : R)

are exponentially unlikely; maximizing

\hat{Δ}

is the natural scalar objective, the regulator appears to optimize on the observed data.

Some standard choices for providing upper bounds to Kolmogorov complexity are Lempel-Ziv compressors (LZ77/LZ78/LZW). LZ-type compressors are universal in a weak sense for stationary ergodic sources and are widely available. Implementations (gzip, lz4, etc.) are practical proxies for

L_{C} (\cdot)

[22,48]. If both ON and OFF strings are available and a scale-free sanity check of contrast is needed, we can compute

NCD (x, y) : = \frac{C (x y) - min {C (x), C (y)}}{max {C (x), C (y)}},

where

C (\cdot)

is the chosen code length and

x y

is concatenation [49,50]. NCD is heuristic but can reveal whether x is “closer” to trivial baselines than y.

The Block Decomposition Method (BDM) estimates K by tiling a string (or array) into small blocks whose complexities are looked up from Coding-Theorem-Method (CTM) tables (exhaustive output frequency statistics of small machines), plus a logarithmic penalty for multiplicities,

{\hat{K}}_{BDM} (x) \approx \sum_{i} K_{CTM} (b_{i}) + log m_{i},

where

b_{i}

are distinct blocks and

m_{i}

their multiplicities (see [51,52]). This is sensitive to small-scale algorithmic regularities beyond LZ’s parse statistics; it works on 1D/2D data (but depends on the chosen CTM table—size and machine model—and it suffers from boundary/tiling effects and additive constants that can be sizable for short N).

Finally, alternatives include learned compressors based on neural networks. Autoencoder/ variational–autoencoder codecs optimize a rate–distortion (thus MDL) objective, with an explicit codelength view via ELBO and practical lossless coding through bits-back [8,30,53,54,55]. In images and video, end-to-end trained autoencoders, hyperpriors, and autoregressive priors are now standard [31,56,57]. More recently, diffusion models have emerged as a powerful paradigm for high-fidelity perceptual compression, outperforming GANs and VAEs in realism at low bitrates [58,59]. Transformer-based compressors are also rapidly improving—both for images via hybrid Transformer–CNN codecs [60,61] and for general lossless compression using language-model predictors [62,63]. For a comprehensive benchmark of neural lossless compressors, see [64]. See also cross-modal results reported with large models [65] and neural codecs for audio, which now leverage foundation model representations [66,67]. From an MDL perspective, these models implement universal codes whose lengths upper-bound the negative log-likelihood under the learned generative model.

To improve discrimination, we can (i) use paired ON/OFF measurements on the same horizon N; report

\hat{Δ}

and its sampling variability across repeats/seeds; (ii) include trivial controls (e.g., all-zero regulator and randomized regulator) to sanity-check that

\hat{Δ}

responds in the expected direction; (iii) for finite N, complement point estimates with nonparametric tests (paired permutations on

\hat{Δ}

across episodes); (iv) when outputs are multivariate/real-valued, discretize with a fixed, reported quantization and alphabet before compression.

5. Conclusions

We developed a contrastive, algorithmic formulation of regulation: a regulator R is good for a world W at horizon N when it yields a compressible readout that is strictly more compressible than under a null baseline ⌀. This places the GRT claim (“good regulators are models”) on an AIT footing.

If switching a regulator on makes a system’s measured output much simpler to describe (i.e., more compressible) than when the regulator is off, then the regulator is very likely to carry non-trivial information about the world it controls—in the precise Algorithmic Information Theory sense of positive mutual algorithmic information between world and regulator. The strength of this evidence grows exponentially with the compressibility gap: large

Δ

makes explanations with little shared structure vanishingly likely. Practically, this turns the old cybernetics slogan “every good regulator is a model of the system” into a quantitative, testable claim that does not assume linearity, stochastic models, or specific architectures. On each run, the theorem also singles out a canonical scalar objective: the regulator behaves as if it were minimizing the description length of the realized readout (equivalently, maximizing

Δ

).

Probabilistically, if W and R are independently sampled minimal programs (no mutual information), then low readout complexity—and especially the contrastive event “low under R, high under ⌀”—is exponentially unlikely in

| W |

and

| R |

. Thus, sustained compressibility relative to baseline is strong evidence that R shares non-trivial algorithmic structure with W (

M (W : R) > 0

). This is the AIT face of the Good Regulator idea and complements the Internal Model Principle’s structural necessity results for classical regulation: the IMP identifies structural necessities for perfect/robust regulation in classical settings, whereas our AIT view applies beyond linearity and probability and turns regulation into a statement about description length. This bridge clarifies in what limited (yet precise) sense the cybernetics aphorism “good regulators must model” can be made rigorous [11,12]: successful regulation implies positive mutual algorithmic information between world and regulator.

The result supplies: (i) a distribution-free, single-episode diagnostic for “does the controller contain a model?”, (ii) a complement to the IMP (which requires embedding a copy of the signal generator under more restrictive and structured assumptions), and (iii) a simple experimental recipe—fix a lossless compressor, quantize the readout, compute two code lengths (ON vs. OFF), and use their difference

Δ

as evidence of model content in the controller.

Finally, the coding-theorem view identifies a canonical scalar and implicates a planner: runtime minimization of

K (x)

(equivalently, maximization of

Δ

).

All together, these results provide the grounds to justify that if a system is seen to regulate another in the algorithmic sense (reducing the complexity of an output of the regulated system compared to no regulation), we can reasonably infer it is likely that the regulator uses a model of the regulated system and an associated scalar objective function.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

The author thanks Francesca Castaldo for discussions and reviewing the manuscript. The author thanks David Wolpert (SFI) for highlighting open questions regarding the classical regulator theorem.

Conflicts of Interest

Author Giulio Ruffini was employed by the company Neuroelectrics. The author declare there is no conflict of interest.

Abbreviations

KT	Kolmogorov Theory (of consciousness)
AIT	Algorithmic Information Theory
AIF	Active Inference
GRT	Good Regulator Theorem
IMP	Internal Model Principle
MDL	Minimum Description Length
LTI	Linear Time-Invariant
IOC	Inverse Optimal Control
IRL	Inverse Reinforcement Learning

Appendix A

Appendix A.1. Setting and Core Definitions

Universal machine and prefix complexity.

Fix a universal prefix Turing machine U. For any finite binary string x,

K (x) : = min {| p | : U (p) = x}, m (x) : = \sum_{p : U (p) = x} 2^{- | p |} .

By the Coding Theorem there exist machine–dependent

c_{1}, c_{2} > 0

with

c_{1} 2^{- K (x)} \leq m (x) \leq c_{2} 2^{- K (x)}

[9,21,25,37].

Conditioning convention.

A finite horizon

N \in N

is fixed throughout; unless stated otherwise, all complexities are implicitly conditioned on N, e.g.,

K (x) : = K (x ∣ N)

and

m (x) : = m (x ∣ N)

.

Machines and transcripts.

A world W and regulator R are deterministic causal prefix programs that interact for N steps via interface tapes. Their closed-loop interaction produces a binary readout

x = O_{W, R}^{(N)} \in {0, 1}^{N}

. The off/null regulator, denoted ⌀, is the coupling where the regulator’s interface outputs a fixed quiescent symbol (e.g., 0) at all steps, yielding

y = O_{W, ⌀}^{(N)}

.

Joint description and wrapper.

A fixed constant-overhead wrapper decodes shortest descriptions of

(W, R)

and simulates the coupling to print

O_{W, R}^{(N)}

. Denote by

K (W, R)

the length of a shortest self-delimiting code for the pair. We use standard chain rules (e.g.,

K (W, R) = K (W) + K (R ∣ W) \pm O (1)

).

Mutual algorithmic information.

For finite strings

x, y

,

M (x : y) : = K (x) + K (y) - K (x, y) \pm O (log (K (x) + K (y))) .

Equivalently,

M (x : y) = K (x) - K (x ∣ y) \pm O (log)

[9].

Good Algorithmic Regulator (contrastive).

Let

a : = K (O_{W, R}^{(N)})

and

b : = K (O_{W, ⌀}^{(N)})

. Define the gap

Δ : = b - a .

We say that R is a good algorithmic regulator for W at horizon N if

Δ > 0

. (In practice, a and b are estimated by fixed MDL codelengths; see Section 4.1.2.).

Deterministic upper bound.

Since the wrapper simulates the coupling, one always has

K (O_{W, R}^{(N)}) \leq K (W, R) \leq K (W) + K (R) - M (W : R) + O (1) .

Appendix A.2. Three-Tape Turing Machine

Definition A1

(Three-Tape Turing Machine Algorithm). A three-tape Turing machine algorithm is represented by a Turing machine T with three tapes, and consists of the following components:

1.: A finite set of states Q, including a designated start state $q_{0}$ and one or more halting states.
2.: A finite alphabet Σ, including a blank symbol, used for the input, output, and private tapes.
3.: Three finite tapes, divided into cells, where each cell can contain a symbol from Σ. These tapes are designated as the input tape, the output tape, and the non-erase private tape.
4.: A transition function $δ : Q \times Σ^{3} \to Q \times Σ^{3} \times {L, R}^{3}$ , defining how the machine moves between states, writes symbols on the three tapes, and moves the tape heads left (L) or right (R) on each tape.

We further identify the state of the private and output states with a set of variables

V = {v_{1}, v_{2}, \dots, v_{n}}

, with subsets

V_{private}

and

V_{output}

. The time evolution of variables in V is governed by the operation of the Turing machine, as it processes the input, modifies the private tape

V_{private}

, and writes to the output tape

V_{output}

, according to δ. So we can also see an algorithm as a specification of the evolution of a set of variables.

The Turing machine begins in the start state with the input written on the input tape and the other tapes blank. It proceeds according to the transition function, writing into the output and private tapes. The private tape can be written to but not erased. When the machine reaches a halting state, the output is read from the output tape.

Appendix A.3. Prefix-Free Programs vs. Stop-Symbol Delimiters (And Why It Matters)

Let U be a universal prefix machine: the domain of its halting programs is prefix-free, so no valid program is a prefix of another. The associated (prefix/self-delimiting) Kolmogorov complexity is

K_{U} (x) = min {| p | : U (p) = x and p is in a prefix-free domain} .

Working with prefix-free domains aligns program lengths with instantaneous (prefix) codes and invokes Kraft–McMillan inequality, the coding-theoretic backbone that underlies many AIT results, including Levin’s universal distribution and the coding theorem [9,68,69]. (See also standard IT references for Kraft–McMillan and prefix codes [70,71]).

Why prefix-freeness is not a mere technicality.

Instantaneous decodability and Kraft sums. If the halting programs form a prefix code, then for the multiset of program lengths ${| p | : U (p) ↓}$ we have $\sum_{p} 2^{- | p |} \leq 1$ by Kraft–McMillan. This lets us interpret $2^{- | p |}$ as a valid “budget” of probability mass per description and leads to semimeasures like Levin’s universal distribution $m_{U} (x) = \sum_{U (p) = x} 2^{- | p |}$ with $\sum_{x} m_{U} (x) \leq 1$ . This construction is central to algorithmic probability and to the coding theorem (roughly $K (x) \approx - log m (x)$ ) [9,68,72,73].
Clean invariance and chaining inequalities. The invariance theorem (machine-independence of K up to $O (1)$ ) and standard chain rules (e.g., $K (x, y) \leq K (x) + K (y ∣ x) + O (1)$ ) are most naturally proved for prefix machines because self-delimitation removes end-of-program ambiguity in compositions and conditional encodings [9,69].

“Why not just add a stop symbol?”

Suppose we try to avoid the prefix constraint by allowing programs of the form

p #

, where # is an end marker.

If the interpreter ignores any trailing bits after #, then any extension $p # q$ yields the same computation as $p #$ . To keep the domain of halting programs unambiguous, you must reject all extensions $p # q \neq p #$ . But rejecting all such extensions is exactly the prefix-free condition in disguise: no valid codeword is a prefix of another. Thus, a well-implemented “stop-symbol” machine reduces to a prefix-free machine up to a fixed additive overhead for encoding #. Consequently, all asymptotic theorems (invariance, coding theorem, bounds using Kraft) remain unchanged up to $O (1)$ [9,68,69].
If extensions after # are allowed as distinct valid programs, then the set of halting inputs is not prefix-free, Kraft–McMillan can fail, and the sum $\sum_{U (p) = x} 2^{- | p |}$ need not be bounded by 1. This breaks the semimeasure property essential to Levin’s universal distribution and derails the clean link between probability and description length [72,73]. In short: allowing arbitrary padding after a nominal “stop” symbol undermines the probability calculus that AIT relies on.

Implications for our results

All conclusions in this paper that rely on (i) the coding-theoretic view of programs, (ii) semimeasures like

m_{U}

, or (iii) standard chain/invariance bounds continue to hold if one uses a stop-symbol formalism implemented so that descriptions are self-delimiting in the sense above. That formalism is equivalent to the prefix-free setting up to

O (1)

and thus does not change the substance of our arguments or their asymptotic constants. If, however, the stop-symbol scheme admits padded extensions as distinct valid programs, key lemmas using Kraft (and hence bounds derived via

m_{U}

or coding-theorem arguments) may fail or require nonstandard fixes.

Takeaway.

The “prefix business” is not a dispensable technicality; it encodes self-delimitation that makes programs behave like instantaneous codewords. You can implement self-delimitation via explicit markers, but only if you simultaneously forbid any valid extension after the marker—i.e., you recover a prefix-free domain. With that in place, none of the conclusions elsewhere in the paper need to change (beyond harmless

O (1)

shifts). Without it, several probability/complexity identifications break.

Appendix A.4. Coding Theorems (Unconditional and Conditional)

Setup and notation.

Fix a universal prefix Turing machine U. All logarithms are base 2. For a finite string x, let

K (x)

be its (prefix) Kolmogorov complexity:

K (x) : = min {| p | : U (p) = x}

. The universal a priori semimeasure is

m (x) : = \sum_{p : U (p) = x} 2^{- | p |} .

Since the halting programs of a prefix machine form a prefix code, Kraft–McMillan implies

\sum_{x} m (x) \leq 1

.

For conditional versions, we equip U with a read-only auxiliary input tape that holds side information y. Define

K (x ∣ y) : = min {| p | : U (p, y) = x}, m (x ∣ y) : = \sum_{p : U (p, y) = x} 2^{- | p |} .

All

O (1)

terms and constants below depend only on the choice of U, never on x or y.

Theorem A1

(Coding Theorem (unconditional)). There exist machine-dependent constants

c_{1}, c_{2} > 0

such that for all finite strings x,

c_{1} 2^{- K (x)} \leq m (x) \leq c_{2} 2^{- K (x)} .

Equivalently,

- log m (x) = K (x) \pm O (1) .

Proof sketch.

Lower bound. Let

p^{⋆}

be a shortest program for x, so

| p^{⋆} | = K (x)

and

U (p^{⋆}) = x

. Then

m (x) \geq 2^{- | p^{⋆} |} = 2^{- K (x)}

(the constant

c_{1}

absorbs harmless machine choices).

Upper bound. Because

m (\cdot)

is a semimeasure, there exists a prefix code with lengths

ℓ (x) \leq ⌈ - log m (x) ⌉

(Shannon–Fano/Kraft–McMillan). A fixed decoder transforms the codeword for x into x, so

K (x) \leq ℓ (x) + O (1) \leq - log m (x) + O (1)

. Rearranging gives

m (x) \leq c_{2} 2^{- K (x)}

. □

Theorem A2

(Coding Theorem (conditional)). There exist machine-dependent constants

c_{1}^{'}, c_{2}^{'} > 0

such that for all finite strings

x, y

,

c_{1}^{'} 2^{- K (x ∣ y)} \leq m (x ∣ y) \leq c_{2}^{'} 2^{- K (x ∣ y)} .

Equivalently,

- log m (x ∣ y) = K (x ∣ y) \pm O (1) .

Proof

(Proof sketch). Lower bound. With

p^{⋆}

a shortest conditional program for x given y, we have

U (p^{⋆}, y) = x

, hence

m (x ∣ y) \geq 2^{- | p^{⋆} |} = 2^{- K (x ∣ y)}

.

Upper bound. For fixed y,

m (\cdot ∣ y)

is a semimeasure, so there is a prefix code (depending on y) with

ℓ (x ∣ y) \leq ⌈ - log m (x ∣ y) ⌉

and a fixed decoder (shared across all y) that maps codewords plus y to x. Therefore

K (x ∣ y) \leq - log m (x ∣ y) + O (1)

, which rearranges to the stated upper bound. □

Remarks.

The constants $c_{1}, c_{2}, c_{1}^{'}, c_{2}^{'}$ (and all $O (1)$ slacks) depend only on the choice of the universal prefix machine U; changing U shifts $K (\cdot)$ by at most an additive constant (invariance theorem), which becomes a multiplicative constant on $m (\cdot)$ .
Theorems A1 and A2 are often summarized as $m (x) ≍ 2^{- K (x)}$ and $m (x ∣ y) ≍ 2^{- K (x ∣ y)}$ , read “within constant factors”.
Immediate corollaries used in the main text include the posterior under the universal prior: for any program p with $U (p) = x$ ,

$Pr {p ∣ x} = \frac{2^{- | p |}}{m (x)} \in [\frac{1}{c_{2}}, \frac{1}{c_{1}}] \cdot 2^{K (x) - | p |},$

and the geometric excess-length tail: $Pr {| p | \geq K (x) + k ∣ x} \leq C 2^{- k}$ for some constant $C > 0$ .

References.

Original sources and standard expositions: [9,20,21,25,37,38].

Appendix A.5. Why Many Long Descriptions Imply Compressibility, and Why Long Generators Are Unlikely

Fix a universal prefix Turing machine U. For a finite binary string x,

K (x) : = min_{p : U (p) = x} | p |

is (prefix) Kolmogorov complexity, and the Solomonoff–Levin a priori semimeasure is

m (x) = \sum_{p : U (p) = x} 2^{- | p |} .

The coding theorem (a.k.a. Levin’s theorem) states that there exist machine-dependent constants

c_{1}, c_{2} > 0

such that

c_{1} 2^{- K (x)} \leq m (x) \leq c_{2} 2^{- K (x)} .

(A1)

(Background: Solomonoff (1964) [20,25], Zvonkin–Levin (1970) [21]; Vitányi, (2013) [37]; Hutter (2007) [38], Cover & Thomas (2006) [23]).

Appendix A.5.1. Multiplicity ⇒ Compression (Indexing Among Outputs)

For

L \in N

let

N_{\leq L} (x)

be the number of programs of length

\leq L

that output x.

Lemma A1

(Multiplicity compression). If

N_{\leq L} (x) \geq 2^{r}

, then

K (x) \leq L - r + O (log L) .

Proof idea (pedagogical).

Enumerate all programs of length

\leq L

in dovetailing fashion and record each distinct output when first seen; this yields a computable list

A_{L} = (x_{1}, x_{2}, \dots)

. Define the high-multiplicity set

B_{L, r} : = {x \in A_{L} : N_{\leq L} (x) \geq 2^{r}}

. Each

x \in B_{L, r}

“uses” at least

2^{r}

programs, and the total number of prefix programs of length

\leq L

is

< 2^{L + 1}

(Kraft inequality). Hence

| B_{L, r} | \leq \frac{2^{L + 1}}{2^{r}} = 2^{L - r + 1} .

Therefore

x \in B_{L, r}

is specified by: (i) a self-delimiting code for

(L, r)

costing

O (log L)

bits, and (ii) its index in

B_{L, r}

costing

\leq L - r + 1

bits. A fixed decoder reconstructs x from these data, yielding the stated bound on

K (x)

. □

One-line “weight counting” variant.

Since every program of length

\leq L

contributes at least

2^{- L}

to

m (x)

,

m (x) \geq N_{\leq L} (x) 2^{- L} \Rightarrow N_{\leq L} (x) \leq m (x) 2^{L} \leq c_{2} 2^{L - K (x)} by (A 1) .

Rearranging gives Lemma A1 with the

O (1)

hidden in constants.

Appendix A.5.2. Consequences for Posterior over Program Lengths

Let

N_{b} (x)

be the number of exactly b-bit programs with output x. Under the universal prior over programs,

Pr {p} = 2^{- | p |}

, observing x induces the posterior

Pr {| p | = b ∣ x} = \frac{\sum_{p : U (p) = x, | p | = b} 2^{- | p |}}{m (x)} = \frac{N_{b} (x) 2^{- b}}{m (x)} .

Bounding

N_{b} (x)

via

m (x) \geq N_{b} (x) 2^{- b}

and (A1) gives

N_{b} (x) \leq c_{2} 2^{b - K (x)}

. Combining with the lower bound

m (x) \geq c_{1} 2^{- K (x)}

yields the geometric decay with excess length:

Theorem A3

(Excess-length posterior decay). For all

b \geq K (x)

,

Pr {| p | = b ∣ x} \leq \frac{c_{2}}{c_{1}} 2^{- (b - K (x))} .

Equivalently, writing

b = K (x) + k

with

k \geq 1

,

Pr {| p | = K (x) + k ∣ x} \leq C 2^{- k} and Pr {| p | \geq K (x) + k ∣ x} \leq 2 C 2^{- k},

for a machine-dependent constant

C > 0

.

Interpretation.

Every extra bit beyond

K (x)

halves the posterior mass (up to a constant factor). Thus an observed output O with

K (O) = a

is a priori very unlikely to have been produced by a program

b ≫ a

: the posterior probability falls like

2^{- (b - a)}

.

Appendix A.5.3. Why Indexing Becomes Shorter When There Are Many Programs

The key to Lemma A1 is that we index outputs with many descriptions, not the descriptions themselves. As the multiplicity

N_{\leq L} (x)

grows by a factor of

2^{r}

, the set of such outputs shrinks by the same factor, so the index shortens by r bits; this directly yields the

L - r

bound. (See also exercises and discussion in Li–Vitányi, 4th ed., Chs. 2–3 [9] and Vereshchagin (2008) [24]).

Appendix A.5.4. Remarks

(i) Prefix complexity is essential: the domain of U is prefix-free, giving Kraft’s inequality and the well-defined prior

m (\cdot)

. (ii) Conditional variants follow verbatim: replace

K (\cdot)

by

K (\cdot ∣ y)

and

m (\cdot)

by

m (\cdot ∣ y)

(see Vitányi (2013) [37]). (iii) There is no uniform lower bound in k: for some x there may be no programs of some intermediate lengths due to prefix-freeness; Theorem A3 gives an essentially tight upper bound on the posterior mass at/above length

K (x) + k

.

Appendix A.6. Single-Episode Compressibility Is Non-Diagnostic

Intuitively, knowing that the regulator-world coupled system produces a low-complexity world output x reduces the set of possible worlds to select from. In turn, this allows for a shorter description of the world using R and the complexity bound of the output. The program may say: “To specify W, run the dynamics for all possible W-R pairs and delete all world model candidates with complex outputs (above the set complexity bound

K (x) < a

). Then use a reduced index to identify W”. This means that

K (W | R, “ K (x) < a ”) < K (W)

, which implies

M (W; R | “ K (x) < a ”) > 0

.

Theorem A4.

(low complexity output ⇒ strict but tiny shrinkage) Fix a universal prefix machine. Let

m : = | W |

and

r : = | R |

denote minimal code lengths, and let

N \geq m

. For a fixed regulator R and horizon N, consider the class

P_{m}

of all minimal m-bit world programs. Assume we only know that the closed-loop transcript has low complexity,

E_{a} : K (O_{W, R}^{(N)} ∣ R, N) \leq a,

for some threshold

a < m - c

, where

c = O (1)

is a machine-dependent constant. Claim. The set of candidates consistent with

E_{a}

is a strict subset of

P_{m}

:

S_{R, N, a} (m) : = \{W \in P_{m} : K (O_{W, R}^{(N)} ∣ R, N) \leq a\} ⊊ P_{m} .

Consequently,

K (W ∣ R, E_{a}) \leq {log}_{2} (| P_{m} | - 1) < {log}_{2} | P_{m} | = m \pm O (1),

i.e., strictly

K (W ∣ R, E_{a}) < K (W)

(by a vanishingly small amount).

Proof.

By Kleene’s recursion theorem (quines), there exists a program

W^{⋆} \in P_{m}

that prints its own source as the first m output bits and then halts (or pads). Hence

K (O_{W^{⋆}, R}^{(N)} ∣ R, N) \geq K (W^{⋆}) - O (1) = m - O (1) > a

, so

W^{⋆} \notin S_{R, N, a} (m)

. Therefore

S_{R, N, a} (m) ⊊ P_{m}

, implying

log | S_{R, N, a} (m) | < log | P_{m} | = m \pm O (1)

. □

To see how small the information gained can be, consider a world program W whose last line is “print

u \times O_{R}

,” where u is some computed world variable. If R simply outputs 0, the world output becomes the all-zeros string, hence very compressible. Knowing that R outputs 0 and that the world output is

0^{N}

does restrict the structure of the world program (it must include the final multiplication by the regulator output, or something similar on the realized trace), but that restriction can be tiny—the calculation of u may still be arbitrarily complex.

Although we have shown that R and

E_{a}

together share information with W, it may be very small for any given case, and, in any case, this does not imply that R and W share information. The chain rule gives

\begin{matrix} M (W : (R, E)) & = K (W) + K (R, E) - K (W, R, E) \\ = K (W) + [K (R) + K (E ∣ R)] - [K (R) + K (W, E ∣ R)] \pm O (log) \\ = \underset{M (W : R)}{\underset{︸}{K (W) + K (R) - K (W, R)}} + \underset{M (W : E ∣ R)}{\underset{︸}{K (W ∣ R) + K (E ∣ R) - K (W, E ∣ R)}} \pm O (log) . \end{matrix}

Thus, knowing that the coupled

(W, R)

system produces a low-complexity readout x in a single run strictly prunes the set of candidate worlds, but in the worst case this shrinkage is only

O (1)

and—critically—does not by itself imply

M (W : R) > 0

; it certifies at most

M (W : (R, E_{a})) > 0

via the chain rule.

Does contrast fix the non-probabilistic identifiability? Let

E_{a, b}

be the (contrastive) event

E_{a, b} : K (O_{W, R}^{(N)}) \leq a and K (O_{W, ⌀}^{(N)}) \geq b (b > a) .

The deterministic shrinkage equals

K (W) - K (W ∣ R, E_{a, b}) = M (W : (R, E_{a, b})) \pm O (log),

and by the chain rule this splits as

M (W : (R, E_{a, b})) = M (W : R) + M (W : E_{a, b} ∣ R) \pm O (log) .

(A2)

Thus, from single-episode ON/OFF facts we can certify at most

M (W : (R, E_{a, b})) > 0

; in general this does not imply

M (W : R) > 0

, because the conditional term

M (W : E_{a, b} ∣ R)

can carry (almost) all the gain or because of synergy.

Furthermore, even if the mutual algorithmic information between world and regulator is null, it may be the case that coupling them leads to a reduction in complexity in the world output by chance.

These caveats motivate the probabilistic analysis in the paper.

We discuss in more detail the case of synergy, and also show that a decrease in complexity cannot certify mutual information in a particular case.

Chain Rule and a Synergy Counterexample

By the chain rule for mutual information,

M (W : (R, E_{a, b})) = M (W : R) + M (W : E_{a, b} ∣ R^{}) + O (log n),

(A3)

where

R^{}

is a shortest description of R (drop

R^{}

and the

O (log n)

term in the Shannon case). (Algorithmic version: Li & Vitányi, An Introduction to Kolmogorov Complexity and Its Applications, 4th ed., Springer, 2019 [9]. Shannon version: Cover & Thomas, Elements of Information Theory, 2nd ed., Wiley, 2006 [23]).

Thus, observing that

M (W : (R, E_{a, b})) > 0

does not imply

M (W : R) > 0

, because the conditional term

M (W : E_{a, b} ∣ R)

can carry (almost) all of the gain.

Example A1

(XOR/synergy). Let

R, E_{a, b} \in {0, 1}^{n}

be independent, incompressible strings, and set

W = R \oplus E_{a, b}

(bitwise XOR). Then:

\begin{matrix} M (W : R) & \overset{+}{=} K (W) + K (R) - K (W, R) \\ \overset{+}{\leq} K (W) - K (W ∣ R) \\ \overset{+}{=} K (W) - K (E_{a, b} ∣ R) \\ \overset{+}{\leq} O (log n), \end{matrix}

(A4)

because

E_{a, b} \mapsto W

is a bijection given R and independence gives

K (E_{a, b} ∣ R) \overset{+}{\geq} n

. In contrast,

\begin{matrix} M (W : (R, E_{a, b})) & \overset{+}{=} K (W) - K (W ∣ R, E_{a, b}^{}) \\ \overset{+}{\geq} n - O (log n), \end{matrix}

(A5)

since

K (W ∣ R, E_{a, b}) = O (1)

and

K (W) \overset{+}{\geq} K (W ∣ R) \overset{+}{\geq} n

. Hence, the conditional term

M (W : E_{a, b} ∣ R)

carries essentially all the information. For the Shannon analogue, take

R, E_{a, b} \sim Ber (\frac{1}{2})

i.i.d.; then

I (W; R) = 0

,

I (W; E_{a, b} ∣ R) = H (W) = n

, so

I (W; (R, E_{a, b})) = n

. (XOR–synergy as a canonical case in multivariate information: Williams & Beer (2010) [74]. For the identity

M (x : y) = K (x) - K (x ∣ y^{}) + O (log)

used above, see Bennett, Gács, Li, Vitányi & Zurek (1998) [75].)

Appendix A.7. Chance Simplification with M(W:R)≈0 Is Possible

Fix a universal prefix Turing machine U and a finite horizon N. All complexities are implicitly conditioned on N (we write

K (\cdot)

for

K (\cdot | N)

). Identify Turing machines with their shortest prefix codes and write

| W | = K (W)

,

| R | = K (R)

. The coupled world–regulator system produces a deterministic readout

x : = O_{W, R}^{(N)} \in {0, 1}^{N} .

There is no auxiliary map: a fixed, constant-overhead wrapper decodes

(W, R)

and simulates the interaction to print x (decode + simulate). Consequently

K (x) \leq K (W, R) + O (1) = K (W) + K (R) - M (W : R) \pm O (log N),

(A6)

and we use the standard identity

M (X : Y) = K (X) - K (X ∣ Y^{}) \pm O (log)

. (See Equation (1) and the chain-rule algebra in §2–3 of the WP). (For textbook background on prefix complexity, chain rules, and

M (x : y) = K (x) - K (x ∣ y^{}) \pm O (log)

, see Li & Vitányi (2019) [9], and Bennett et al. (1998) [75]).

For concreteness in the examples below we take

| W | = | R | = n

and set

N = n

; this is only for clarity (all statements have the obvious adjustments if

N \neq n

).

Claim (It can happen that $K (x)$ is small while $M (W : R) \approx 0$ ).

There exist pairs

(W, R)

with

M (W : R) = O (log n)

such that the coupled output

x = O_{W, R}^{(N)}

has very small complexity (e.g.,

K (x) = O (log N)

).

Construction (existence, uses only the

W + R

coupling). Fix a threshold

Δ \in {1, \dots, N}

. Define a world program

W_{Δ}

that monitors the first

Δ

symbols emitted by the regulator on the interface and then latches:

if O_{R} [1 : Δ] = 0^{Δ} then output x = 0^{N}; else output a fixed incompressible z \in {0, 1}^{N} .

Here z is hard-coded in

W_{Δ}

(so

K (z) \overset{+}{=} N

and

K (W_{Δ}) \overset{+}{=} | W | = n

). Choose any regulator

R^{(Δ)}

whose first

Δ

interface outputs are

0^{Δ}

and whose remaining behavior is generated by a shortest program of length

\overset{+}{=} n

independent of

W_{Δ}

. Then

M (W_{Δ} : R^{(Δ)}) = O (log n) but x = 0^{N} \Rightarrow K (x) = O (log N) .

Thus, even with

M (W : R) \approx 0

(up to the usual

O (log)

slack), the coupled program can, on the realized episode, yield a low-complexity output.

“Rare but possible” bound (balanced couplings).

Suppose the world implements a balanced dependence on the regulator’s interface in the sense that, for fixed W, the map

u \mapsto x

is a permutation of

{0, 1}^{N}

when we view

u : = O_{R} [1 : N]

as the regulator’s output sequence (e.g., the world computes

x = z \oplus u

with a fixed

z = z (W)

). If R is sampled independently and its interface sequence u is (close to) uniform on

{0, 1}^{N}

(e.g., drawn from a family with pseudorandom outputs), then by the standard Kolmogorov counting bound (at most

2^{k + 1}

N-bit strings have

K \leq k

),

Pr [K (x) \leq k] \leq 2^{k + 1 - N} .

Equivalently, the probability of a

Δ

-bit drop (

K (x) \leq N - Δ

) is

\leq 2^{1 - Δ}

. Thus, a very simple x can occur by chance, but only with exponentially small probability in the amount of simplification. (Counting bound: at most

2^{k + 1}

strings of length N have complexity

\leq k

; see Li & Vitányi (2019) [9]).

Ex-post constraint when R is invertible from $(W, x)$ .

If the coupled architecture allows recovery of R from

(W, x)

via a computable inverse (i.e., there exists a fixed decoder such that

R = G (W, x)

), then

K (x) \geq K (R ∣ W^{}) - O (1) = K (R) - M (W : R) - O (log n) .

Hence, with

K (R) \overset{+}{=} n

and

M (W : R) \approx 0

, a large drop in

K (x)

cannot occur under such invertible (in R) couplings. When a very small x is observed in this case, it forces

M (W : R)

to be large. (Identity used:

M (X : Y) = K (X) - K (X ∣ Y^{}) \pm O (log)

).

References

Ruffini, G. An Algorithmic Information Theory of Consciousness. Neurosci. Conscious. 2017, 2017, nix019. [Google Scholar] [CrossRef]
Ruffini, G.; Lopez-Sola, E. AIT Foundations of Structured Experience. J. Artif. Intell. Conscious. 2022, 9, 153–191. [Google Scholar] [CrossRef]
Ruffini, G.; Castaldo, F.; Vohryzek, J. Structured Dynamics in the Algorithmic Agent. Entropy 2025, 27, 90. [Google Scholar] [CrossRef]
Friston, K. A Free Energy Principle for Biological Systems. Entropy 2012, 14, 2100–2121. [Google Scholar] [CrossRef] [PubMed]
Parr, T. Active Inference: The Free Energy Principle in Mind, Brain, and Behavior; The MIT Press: Cambridge, MA, USA, 2022. [Google Scholar]
Gács, P.; Tromp, J.; Vitányi, P.M.B. Algorithmic Statistics. IEEE Trans. Inf. Theory 2001, 47, 2443–2463. [Google Scholar] [CrossRef]
Barron, A.R.; Rissanen, J.; Yu, B. The Minimum Description Length Principle in Coding and Modeling. IEEE Trans. Inf. Theory 1998, 44, 2743–2760. [Google Scholar] [CrossRef]
Grünwald, P.D. The Minimum Description Length Principle; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar] [CrossRef]
Li, M.; Vitányi, P.M.B. An Introduction to Kolmogorov Complexity and Its Applications, 4th ed.; Texts in Computer Science; Springer: Cham, Switzerland, 2019. [Google Scholar] [CrossRef]
Willems, J.C.; Rapisarda, P.; Markovsky, I.; De Moor, B.L.R. A Note on Persistency of Excitation. Syst. Control Lett. 2005, 54, 325–338. [Google Scholar] [CrossRef]
Conant, R.C.; Ashby, W.R. Every good regulator of a system must be a model of that system. Int. J. Syst. Sci. 1970, 1, 89–97. [Google Scholar] [CrossRef]
Baez, J. The Internal Model Principle. Azimuth 2016. Available online: https://johncarlosbaez.wordpress.com/2016/01/27/the-good-regulator-theorem/ (accessed on 15 February 2026).
Francis, B.A.; Wonham, W.M. The internal model principle for linear multivariable regulators. Appl. Math. Optim. 1975, 2, 170–194. [Google Scholar] [CrossRef]
Francis, B.A.; Wonham, W.M. The internal model principle of control theory. Automatica 1976, 12, 457–465. [Google Scholar] [CrossRef]
Sontag, E.D. Adaptation and regulation with signal detection implies internal model. Syst. Control Lett. 2003, 50, 119–126. [Google Scholar] [CrossRef]
Bin, M.; Huang, J.; Isidori, A.; Marconi, L.; Mischiati, M.; Sontag, E.D. Internal Models in Control, Bioengineering, and Neuroscience. Annu. Rev. Control Robot. Auton. Syst. 2022, 5, 55–79. [Google Scholar] [CrossRef]
Isidori, A.; Byrnes, C. Output regulation of nonlinear systems. IEEE Trans. Autom. Control 1990, 35, 131–140. [Google Scholar] [CrossRef]
Huang, J. Nonlinear Output Regulation: Theory and Applications; Number 8 in Advances in Design and Control; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2004. [Google Scholar] [CrossRef]
Priscoli, F.D.; Marconi, L.; Isidori, A. Adaptive Observers as Nonlinear Internal Models. Syst. Control Lett. 2006, 55, 640–649. [Google Scholar] [CrossRef]
Solomonoff, R.J. A Formal Theory of Inductive Inference. Part I. Inf. Control 1964, 7, 1–22. [Google Scholar] [CrossRef]
Zvonkin, A.K.; Levin, L.A. The Complexity of Finite Objects and the Development of the Concepts of Information and Randomness by Means of the Theory of Algorithms. Russ. Math. Surv. 1970, 25, 83–124. [Google Scholar] [CrossRef]
Ziv, J.; Lempel, A. A Universal Algorithm for Sequential Data Compression. IEEE Trans. Inf. Theory 1977, 23, 337–343. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar] [CrossRef]
Vereshchagin, N.; Vitányi, P.M.B. Kolmogorov’s Structure Functions and Model Selection. IEEE Trans. Inf. Theory 2004, 50, 3265–3290. [Google Scholar] [CrossRef]
Solomonoff, R.J. A Formal Theory of Inductive Inference. Part II. Inf. Control 1964, 7, 224–254. [Google Scholar] [CrossRef]
Ruffini, G. Navigating Complexity: How Resource-Limited Agents Derive Probability and Generate Emergence. OSF Preprints 2024. [Google Scholar] [CrossRef]
Gray, R.M. Entropy and Information Theory, 2nd ed.; Springer: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
Åström, K.J.; Murray, R.M. Feedback Systems: An Introduction for Scientists and Engineers; Princeton University Press: Princeton, NJ, USA, 2008. [Google Scholar]
Chaitin, G.J. A Theory of Program Size Formally Identical to Information Theory. J. ACM 1975, 22, 329–340. [Google Scholar] [CrossRef]
Hinton, G.E.; Zemel, R.S. Autoencoders, minimum description length and Helmholtz free energy. In Proceedings of the Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 1993; Volume 6, pp. 3–10. [Google Scholar]
Ballé, J.; Minnen, D.; Singh, S.; Hwang, S.J.; Johnston, N. Variational image compression with a scale hyperprior. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Gardner, M. Mathematical Games: The fantastic combinations of John Conway’s new solitaire game “Life”. Sci. Am. 1970, 223, 120–123. [Google Scholar]
Chan, B.W.C. Lenia: Biology of artificial life. Complex Syst. 2019, 28, 251–286. [Google Scholar] [CrossRef]
Plantec, E.; Hamon, G.; Etcheverry, M.; Oudeyer, P.Y.; Moulin-Frier, C.; Chan, B.W.C. Flow Lenia: Mass conservation for the study of virtual creatures in continuous cellular automata. arXiv 2023, arXiv:2212.07906. [Google Scholar]
Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann: San Mateo, CA, USA, 1988. [Google Scholar]
Kirchhoff, M.; Parr, T.; Palacios, E.; Friston, K.; Kiverstein, J. The Markov blankets of life: Autonomy, active inference and the free energy principle. J. R. Soc. Interface 2018, 15, 20170792. [Google Scholar] [CrossRef]
Vitányi, P.M.B. Conditional Kolmogorov Complexity and Universal Probability. Theor. Comput. Sci. 2013, 501, 93–100. [Google Scholar] [CrossRef]
Hutter, M. On Universal Prediction and Bayesian Confirmation. Theor. Comput. Sci. 2007, 384, 33–48. [Google Scholar] [CrossRef]
Anderson, B.D.O.; Moore, J.B. Optimal Control: Linear Quadratic Methods; Prentice Hall: Englewood Cliffs, NJ, USA, 1990. [Google Scholar]
Ng, A.Y.; Russell, S.J. Algorithms for Inverse Reinforcement Learning. In Proceedings of the 17th International Conference on Machine Learning (ICML); Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2000; pp. 663–670. [Google Scholar]
Abbeel, P.; Ng, A.Y. Apprenticeship Learning via Inverse Reinforcement Learning. In Proceedings of the 21st International Conference on Machine Learning (ICML); Association for Computing Machinery: New York, NY, USA, 2004; pp. 1–8. [Google Scholar] [CrossRef]
Ziebart, B.D.; Maas, A.; Bagnell, J.A.; Dey, A.K. Maximum Entropy Inverse Reinforcement Learning. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2008; pp. 1433–1438. [Google Scholar]
Afriat, S.N. The Construction of Utility Functions from Expenditure Data. Int. Econ. Rev. 1967, 8, 67–77. [Google Scholar] [CrossRef]
Varian, H.R. The Nonparametric Approach to Demand Analysis. Econometrica 1982, 50, 945–973. [Google Scholar] [CrossRef]
Debreu, G. Representation of a Preference Ordering by a Numerical Function. In Decision Processes; Thrall, R.M., Coombs, C.H., Davis, R.L., Eds.; John Wiley & Sons: New York, NY, USA, 1954; pp. 159–165. [Google Scholar]
Savage, L.J. The Foundations of Statistics; John Wiley & Sons: New York, NY, USA, 1954. [Google Scholar]
Karni, E.; Schmeidler, D. An Expected Utility Theory for State-Dependent Preferences. Theory Decis. 2016, 81, 467–478. [Google Scholar] [CrossRef]
Ruffini, G. Lempel-Zip Complexity Reference. arXiv 2017, arXiv:1707.09848. [Google Scholar] [CrossRef]
Li, M.; Chen, X.; Li, X.; Ma, B.; Vitanyi, P. The Similarity Metric. arXiv 2004, arXiv:cs/0111054. [Google Scholar] [CrossRef]
Cilibrasi, R.; Vitanyi, P. Clustering by Compression. arXiv 2004, arXiv:cs/0312044. [Google Scholar] [CrossRef]
Soler-Toscano, F.; Zenil, H.; Delahaye, J.P.; Gauvrit, N. Calculating Kolmogorov Complexity from the Output Frequency Distributions of Small Turing Machines. PLoS ONE 2014, 9, e96223. [Google Scholar] [CrossRef]
Zenil, H.; Hernández-Orozco, S.; Kiani, N.A.; Soler-Toscano, F.; Rueda-Toicen, A. A Decomposition Method for Global Evaluation of Shannon Entropy and Local Estimations of Algorithmic Complexity. arXiv 2016, arXiv:1609.00110. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. An Introduction to Variational Autoencoders. Found. Trends Mach. Learn. 2019, 12, 307–392. [Google Scholar] [CrossRef]
Townsend, J.; Bird, T.; Barber, D. Practical Lossless Compression with Latent Variables using Bits Back Coding. arXiv 2019, arXiv:1901.04866. [Google Scholar] [CrossRef]
Ho, J.; Lohn, E.; Abbeel, P. Compression with Flows via Local Bits-Back Coding. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019); Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2019; Available online: https://arxiv.org/abs/1905.08500.
Minnen, D.; Ballé, J.; Toderici, G. Joint Autoregressive and Hierarchical Priors for Learned Image Compression. In Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS 2018); Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2018. [Google Scholar]
Habibian, A.; van Rozendaal, T.; Tomczak, J.M.; Cohen, T.S. Video Compression With Rate-Distortion Autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republi of Korea, 27 October 2019–2 November 2019; pp. 7032–7041. [Google Scholar] [CrossRef]
Relic, L.; Azevedo, R.; Gross, M.; Schroers, C. Lossy Image Compression with Foundation Diffusion Models. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September 2024–4 October 2024. [Google Scholar]
Yang, R.; Mandt, S. High-Fidelity Image Compression with Score-Based Generative Models. arXiv 2023, arXiv:2305.18231. [Google Scholar]
Lu, M.; Guo, P.; Shi, H.; Cao, C.; Ma, Z. Transformer-based Image Compression. arXiv 2021, arXiv:2111.06707. [Google Scholar] [CrossRef]
Liu, J.; Sun, H.; Katto, J. Learned Image Compression with Mixed Transformer-CNN Architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14388–14397. [Google Scholar]
Delétang, G.; Ruoss, A.; Duquenne, P.A.; Catt, E.; Genewein, T.; Mattern, C.; Grau-Moya, J.; Wenliang, L.K.; Aitchison, M.; Orseau, L.; et al. Language Modeling Is Compression. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Valmeekam, C.S.K.; Narayanan, K.; Kalathil, D.; Chamberland, J.F.; Shakkottai, S. LLMZip: Lossless Text Compression Using Large Language Models. arXiv 2023, arXiv:2306.04050. [Google Scholar] [CrossRef]
Sun, H.; Ma, H.; Ling, F.; Xie, H.; Sun, Y.; Yi, L.; Yan, M.; Zhong, C.; Liu, X.; Wang, G. A survey and benchmark evaluation for neural-network-based lossless universal compressors toward multi-source data. Front. Comput. Sci. 2025, 19, 197360. [Google Scholar]
Li, Z.; Huang, C.; Wang, X.; Hu, H.; Wyeth, C.; Bu, D.; Yu, Q.; Gao, W.; Liu, X.; Li, M. Lossless Data Compression by Large Models. Nat. Mach. Intell. 2025, 7, 794–799. [Google Scholar] [CrossRef]
Zeghidour, N.; Luebs, A.; Omran, A.; Skoglund, J.; Tagliasacchi, M. SoundStream: An End-to-End Neural Audio Codec. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 495–507. [Google Scholar] [CrossRef]
Ma, Y.; Øland, A.; Ragni, A.; Sette, B.M.D.; Saitis, C.; Donahue, C.; Lin, C.; Plachouras, C.; Benetos, E.; Shatri, E.; et al. Foundation Models for Music: A Survey. arXiv 2024, arXiv:2408.14340. [Google Scholar] [CrossRef]
Tadaki, K. A Statistical Mechanical Interpretation of Algorithmic Information Theory. J. Phys. Conf. Ser. 2010, 201, 012006. [Google Scholar] [CrossRef]
Fortnow, L. Kolmogorov Complexity. In Aspects of Complexity: Minicourses in Algorithmics, Complexity and Computational Algebra; Downey, R., Hirschfeldt, D., Eds.; de Gruyter Series in Logic and Its Applications; de Gruyter: Berlin, Germany; New York, NY, USA, 2001; Volume 4, pp. 73–86. [Google Scholar] [CrossRef]
Xie, Y. Source Coding and Kraft Inequality. Lecture Notes, ECE 587. 2012. Available online: https://www2.isye.gatech.edu/~yxie77/ece587/Lecture7.pdf (accessed on 15 February 2026).
Singh, A. Lecture 7: Prefix Codes, Kraft–McMillan Inequality. Course Notes, 10-704 Machine Learning. 2016. Available online: https://www.cs.cmu.edu/~aarti/Class/10704/lec7-kraft.pdf (accessed on 15 February 2026).
Sterkenburg, T.F. Solomonoff Prediction and Occam’s Razor. Philos. Sci. 2017, 84, 459–479. [Google Scholar] [CrossRef]
Hutter, M.; Legg, S.; Vitányi, P.M.B. Algorithmic Probability. Scholarpedia 2007, 2, 2572. [Google Scholar] [CrossRef]
Williams, P.L.; Beer, R.D. Nonnegative Decomposition of Multivariate Information. arXiv 2010, arXiv:1004.2515. [Google Scholar] [CrossRef]
Bennett, C.H.; Gács, P.; Li, M.; Vitányi, P.M.B.; Zurek, W.H. Information Distance. IEEE Trans. Inf. Theory 1998, 44, 1407–1423. [Google Scholar] [CrossRef]

Figure 1. Regulation scenario. (A) A good regulator R interacts with the world W so that the readout

x = O_{W}

of the world’s output is clamped to a simple, highly compressible sequence (e.g., almost all zeros). (B) When the regulator is turned off, the output is more complex.

Figure 1. Regulation scenario. (A) A good regulator R interacts with the world W so that the readout

x = O_{W}

of the world’s output is clamped to a simple, highly compressible sequence (e.g., almost all zeros). (B) When the regulator is turned off, the output is more complex.

Figure 2. To connect the IMP and the AIT formulation used here, we view the World W as a box containing E and P; the Regulator/Controller R (or C) is a separate box. Arrows depict Forcing (

E \to P

), Ref (

E \to

sum), the Error path (sum ↓ to the world boundary and

\to R

), and Control (

R \to P

).

Figure 2. To connect the IMP and the AIT formulation used here, we view the World W as a box containing E and P; the Regulator/Controller R (or C) is a separate box. Arrows depict Forcing (

E \to P

), Ref (

E \to

sum), the Error path (sum ↓ to the world boundary and

\to R

), and Control (

R \to P

).

Table 1. Side-by-side comparison of the classical Good Regulator Theorem (GRT), the Internal Model Principle (IMP), and an Algorithmic-Information-Theoretic Regulator Theorem (ART). Primary sources: Conant & Ashby (1970) [11], Francis & Wonham (1975) [13], Francis & Wonham (1976) [14], Sontag (2003) [15], and Li & Vitányi (2019) [9].

Aspect	GRT (Conant–Ashby, 1970)	IMP (Francis–Wonham, 1975/76; Sontag, 2003)	ART (Algorithmic, This Work)
Setting/Objects	System S, Regulator R, Disturbances/Inputs D, Outcomes Z. Mapping $ψ : (S, R) \mapsto Z$ ; compare regulators by entropy of Z.	Plant P in feedback with Controller C; exogenous signals from an exosystem E; regulated output y and error $e = r - y$ .	World W and Regulator R are deterministic causal prefix programs (3-tape UTM) that interact over interface tapes for horizon N; readout $x = O_{W, R}^{(N)}$ .
Symbols (explicit)	S (system), R (regulator), D (disturbance/input), Z (outcome), $H (\cdot)$ (Shannon entropy).	P (plant), E (exosystem/signal generator), C (controller), y (regulated output), signal class $U$ (e.g., steps/sinusoids/polynomials).	W (shortest world program), R (shortest regulator program), $x : = O_{W, R}^{(N)}$ (ON readout), $y : = O_{W, Ø}^{(N)}$ (OFF readout), $K (\cdot)$ (prefix complexity), $M (\cdot : \cdot)$ (mutual algorithmic information).
Definition of “model”	Deterministic mapping/homomorphism $h : S \to R$ that preserves task-relevant structure so outcomes have low entropy.	Internal model: a dynamical subsystem embedded in C that reproduces E (controller contains a copy of E’s dynamics; in LTI, matching poles such as integrators/resonators).	Algorithmic model (program): R shares computable structure with W—formally $M (W : R) > 0$ (equivalently $K (W ∣ R) < K (W)$ ); no need for a literal dynamical replica.
Notion of “goodness”	“Maximally successful and simple”: minimize $H (Z)$ and avoid un-necessary regulator randomness/complexity.	Perfect regulation for a specified class $U$ (exact asymptotic tracking/disturbance rejection, robustness in class).	Compressibility of realized readout: good if $K (x)$ is small at the chosen N; use contrastive gap $Δ : = K (O_{W, Ø}^{(N)}) - K (O_{W, R}^{(N)}) > 0$ .
Core Theorem Statement	Among regulators that minimize $H (Z)$ and are simplest, there is a deterministic $h : S \to R$ ; informally: “every good regulator is (contains) a model of the system.”	Necessity: perfect regulation for class $U$ requires C to embed a copy of E (an internal model).	Algorithmic necessity: with ON x and OFF complexity $K (O_{W, Ø}^{(N)}) = b$ , the universal posterior obeys $Pr ((W, R) ∣ x, E_{b}^{R}) \leq C 2^{M (W : R)} 2^{- Δ}$ . Thus sustained $Δ > 0$ makes low $M (W : R)$ exponentially unlikely; on the realized episode, maximizing ON over OFF likelihood is equivalent (up to $O (1)$ ) to minimizing $K (x)$ (i.e., maximizing $Δ$ ).
Assumptions	Z is well-defined from $(S, R)$ and disturbances; regulators compared by $H (Z)$ and simplicity [11].	Typically finite-dimensional LTI; stabilizable/detectable; E autonomous and neutrally stable; exact asymptotic tracking/rejection for $U$ ; robustness in a plant neighborhood [13,14,15].	Deterministic closed coupling; fixed universal prefix machine and horizon N; $W, R$ are minimal self-delimiting programs; constant-overhead wrapper for $(W, R, N) \mapsto O_{W, R}^{(N)}$ ; diagnostic readout (contrast usable). In practice, estimate $K (\cdot)$ with fixed MDL codelengths.
Restrictions/Limitations	“Model” notion is weak (mapping); success tied to entropy of Z (can reward trivial predictable outcomes); no explicit stability claims.	Sharpest for LTI; nonlinear/output-regulation extensions add local solvability/detectability/zero-dynamics stability; necessity generally local/structural.	Information-theoretic (not structural) necessity; strength depends on diagnostic $Δ$ ; $K (\cdot)$ uncomputable (use fixed compressor/MDL); single-episode statements (with probabilistic tilt).
Scope/Use	Conceptual cybernetics link: regulation ⇒ representation (model-building is compulsory).	Design backbone for robust regulation (integral action, embedded oscillators); concrete synthesis constraints.	Distribution-free, single-episode diagnostics; empirical recipe: fix a lossless compressor, quantize readout, compute ON/OFF code lengths, use $Δ$ as evidence of model content; complements IMP with universal Occam calculus AIT [9].

Table 2. Mapping the IMP triple

(E, C, P)

and the AIT

(W, R)

view to a simple thermostat. IMP emphasizes an internal model of the exosystem E for exact regulation over a signal class; AIT treats

W = (P, E)

jointly and assesses regulation by a compressibility advantage

Δ

.

Table 2. Mapping the IMP triple

(E, C, P)

and the AIT

(W, R)

view to a simple thermostat. IMP emphasizes an internal model of the exosystem E for exact regulation over a signal class; AIT treats

W = (P, E)

jointly and assesses regulation by a compressibility advantage

Δ

.

Role	IMP Language	AIT Language (This Work)	Thermostat Instantiation
Exogenous generator	Exosystem E: autonomous generator of references/disturbances (no feedback from C); exact regulation is defined w.r.t. a signal class $U$ .	Fold into the World W; no architectural split is required (but may still conceptually identify this subpart).	Reference $r (t)$ : setpoint schedule (often clock-driven). Disturbances: outdoor temperature, solar load, occupancy heat gains.
Plant	Plant P: room thermal dynamics + actuator/sensor; used for stabilization/shaping.	Also inside World W.	R–C (thermal) model, heater actuation, heat losses, sensor dynamics/delay.
Controller/ Regulator	Controller C (the regulator in IMP).	Regulator R.	Thermostat logic: bang-bang with hysteresis, PI/TPI, or scheduled control.
Measured output	y.	World readout x extracted from the transcript (often $x = y$ or the error string $e_{1 : T}$ ).	Indoor temperature $T_{in}$ (or a weighted error signal).
Error/ objective	$e = r - y$ ; IMP concerns asymptotic $e \to 0$ for all $r, d$ in the class $U$ (internal model must match E).	Score regulation by compressibility of the chosen readout x with R ON vs. an OFF baseline ( $R = ⌀$ ). Define the gap $Δ = K (x_{off}) - K (x_{on})$ (practically, use a fixed MDL code $L_{C}$ in place of K).	Good thermostat ⇒ $x_{on}$ (e.g., temperature or error) stays near a regular deadband pattern ⇒ shorter code than the null/open-loop case (heater OFF or fixed duty).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ruffini, G. The Algorithmic Regulator. Entropy 2026, 28, 257. https://doi.org/10.3390/e28030257

AMA Style

Ruffini G. The Algorithmic Regulator. Entropy. 2026; 28(3):257. https://doi.org/10.3390/e28030257

Chicago/Turabian Style

Ruffini, Giulio. 2026. "The Algorithmic Regulator" Entropy 28, no. 3: 257. https://doi.org/10.3390/e28030257

APA Style

Ruffini, G. (2026). The Algorithmic Regulator. Entropy, 28(3), 257. https://doi.org/10.3390/e28030257

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Algorithmic Regulator

Abstract

1. Introduction

1.1. Definition of Model

1.2. Regulation as Compression

2. Setting

The Coupled World-Regulator System

3. Probabilistic Regulator Theorems

3.1. Posterior Form, Given the Observed x

3.2. The Good Algorithmic Regulator and Posterior with Contrast

3.3. Inferring the Objective Function and Planner (As-If Agent)

4. Discussion

4.1. Why AIT Is Needed

4.1.1. Relation to the Internal Model Principle (IMP)

4.1.2. Practical Estimation of K and the Gap Δ

5. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Setting and Core Definitions

Appendix A.2. Three-Tape Turing Machine

Appendix A.3. Prefix-Free Programs vs. Stop-Symbol Delimiters (And Why It Matters)

Appendix A.4. Coding Theorems (Unconditional and Conditional)

Appendix A.5. Why Many Long Descriptions Imply Compressibility, and Why Long Generators Are Unlikely

Appendix A.5.1. Multiplicity ⇒ Compression (Indexing Among Outputs)

Appendix A.5.2. Consequences for Posterior over Program Lengths

Appendix A.5.3. Why Indexing Becomes Shorter When There Are Many Programs

Appendix A.5.4. Remarks

Appendix A.6. Single-Episode Compressibility Is Non-Diagnostic

Chain Rule and a Synergy Counterexample

Appendix A.7. Chance Simplification with M(W:R)≈0 Is Possible

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.1.2. Practical Estimation of K and the Gap $Δ$