1. Introduction
Much of machine learning focuses on
inductive inference: fitting a function to past data and expecting it to generalize to similar inputs. While valuable, this perspective is incomplete. In an agentic setting, we want a pre-trained model to
reason at inference time about the structure of a
specific instance of a novel task, rather than simply apply a map fixed during training. From this alternative viewpoint, the role of learning shifts from building a fixed classifier to equipping a solver with the knowledge needed to efficiently tackle previously unseen problems. This perspective draws on what Vapnik termed
transductive inference [
1]: reasoning directly about specific instances rather than applying a mapping fixed at training time.
Many tasks in machine learning can be, and historically have been, approached inductively: classifying an object, for example, relies mainly on similarity with past observations rather than reasoning about the instance itself. However, AI is increasingly tackling problems where this framework falls short. Consider writing code that passes a given set of unit tests, proving a theorem with access to a proof checker, or finding a protein configuration that minimizes an energy function. In these cases, the task is not implicitly defined by past examples, as in inductive learning, but rather by, at least approximately, a known verifier or reward signal capable of evaluating any candidate solution. More precisely, each problem instance x is paired with a function that can score or verify any candidate solution y.
This setting poses a fundamentally different challenge from inductive learning. Finding a correct solution is, in principle, always achievable: one could enumerate candidates y until one satisfies . This brute-force strategy guarantees success, provided we are willing to wait a time exponential in the length of the solution. Unlike in inductive learning, the core difficulty is therefore not accuracy or generalization, which can always be ensured given enough time, but computational efficiency: quickly finding a solution to a previously unseen task f on an instance x.
For any fixed task, one could of course design or train a bespoke solver, carefully optimized for that specific problem, that finds solutions extremely fast. But could there exist a general solver that tackles any unseen task nearly as efficiently as the best task-specific solver?
That would seem too good to be true, and likely to violate some kind of “no free lunch theorem.” Yet, Levin [
2] and Solomonoff [
3] showed that a
universal solver U can solve any instance
x of a task
essentially as well as a solver
A that is optimal for that task. In particular, the universal solver finds a solution in time
bounded by
where
is the description length of
A and
is the optimal time to find a solution to that instance. Crucially, the universal solver requires only a constant factor longer than a task-specific solver, where the constant depends on the
complexity of the optimal solver, not on the particular instance
x. The catch is that such constant factor
can be astronomically large.
This is where
learning comes in: In [
3] Solomonoff observed that, even if a task has never been faced before, prior experience lets us encode effective problem-solving programs more succinctly (e.g., by reusing components of the solution) thereby reducing the factor
. Thus, in the transductive setting with no uncertainty on the reward, the value of learning is measured
not by a reduction in error rate as in induction but in the reduction of the
time it takes to find solutions to unforeseen tasks. This points to a foundational principle for transductive learning: Rather than trying to capture the statistical structure (joint distribution) of past data in hope that future data will respect it, as in induction, in transduction we want to capture the shared algorithmic structure of past data, given which an agent can reason to quickly find solutions to
new computable tasks. Accordingly, to be able to solve
general unseen tasks, an agent should
learn to perform transductive inference; that is, they should “learn to reason”.Learning transductive inference as envisioned by Solomonoff in 1984 [
3] has been mostly an academic concern for decades because it seems to require (meta-)learning a conditional distribution over
programs that, when run through a universal Turing machine (UTM), solve a given task. This has been impractical until recently. But modern large-scale models, such as large language models (LLMs) or reasoning models, can effectively encode distributions over programs. Moreover, they themselves serve, loosely speaking, as a powerful new type of
computation engine—planning, searching, calling tools, and coordinating multi-step reasoning—which, however, fundamentally differs from UTMs and does not cleanly fit either the theory of Solomonoff and Levin or the classical mold of inductive learning.
In this work, we embrace the transductive view and formalize learning in a way that fits modern reasoning models. We show in what sense an LLM, despite being stochastic and in many ways antithetical to Turing machines, is a valid model of universal computation. We then contextualize Levin’s guarantees and Solomonoff’s vision using LLMs instead of UTMs, which requires entirely different proof techniques, study resource-aware objectives, and ground these ideas into practical algorithms that trade inference-time computing to generalize to novel problems not seen during training.
Our analysis focuses on universal solvers in a verifiable setting, where an oracle is provided to compute the reward or verify a solution. This restriction yields a clean theoretical framework, but it is admittedly narrow: most real-world problems involve both an inductive component—reducing uncertainty from data—and a transductive component—reasoning efficiently about a specific instance. Classical statistical learning theory addresses the former; our theory aims to address the latter. We do not tackle the general case in which the two are intertwined, where reasoning is required and the reward, or an efficient approximation of it, must itself be learned. Nevertheless, we believe that the basic mechanisms we identify—enabling a model to become a universally efficient solver in the spirit of Solomonoff, and the tradeoffs they reveal among learning, algorithmic information, and time—extend beyond the purely verifiable setting.
1.1. Can an LLM-Powered AI Agent Be a Universal Solver?
Levin and Solomonoff showed that a universal solver exists, but the construction hinges on using (deterministic) universal Turing machines (UTMs). LLMs, by contrast, are neither Turing machines nor deterministic, nor do they execute code in the conventional sense. Their computational mechanism is chain-of-thought (CoT) reasoning, which does not map easily to any standard computational paradigm.
To study whether an LLM-powered AI agent can be a universal solver, we need more flexible foundations. (By
agent, we generally refer to an LLM performing chain-of-thought (CoT) in order to solve a task, potentially with access to tools or an environment. These tools could be a Python executor, a proof checker, a web search, or another action to retrieve and process data, making the LLM an agent even if only the final action rendered affects the environment. For a given model and hardware, the computing cost of the agent roughly corresponds to the number of CoT tokens generated to find a solution.) In
Section 2, we extend universal solvers from programs to general stochastic dynamical systems, allowing us to map the theory directly to LLMs with CoT. A key challenge is defining the
time that an LLM-powered agent needs to find a solution to a task: naively using the expected length of CoT leads to degenerate values. We address this by introducing a new notion of
proper time (Definition 2). This notion allows us to generalize Levin’s and Solomonoff’s results to general dynamical systems, showing in particular that LLMs can indeed power universal task solvers despite being unlike any Turing machine.
1.2. Intelligence Is About Time
Once we have secured the foundations for reasoning and computation, we turn to learning. Universal solvers of verifiable tasks are peculiar in that
no information needs to be learned to achieve universally optimal accuracy on any task. For instance, to prove a theorem one could simply iterate over all possible proofs until a correct one appears, without ever having to study any math. Indeed, if we were to measure the information that a training dataset provides to such a task using classical notions (Shannon’s [
4] or Kolmogorov’s [
5]), we would find it to be null [
6].
The role of learning in agentic AI is instead to identify statistical or algorithmic structures that make future inference
more time-efficient. This suggests a notion, complementary to Shannon’s, that
time plays a crucial role in learning information. In particular, we show that the optimal speed-up in finding a solution that a universal solver can achieve using past data is tightly related to the
algorithmic mutual information [
5] between the data and the solution:
Theorem 1 (Information is Speed)
. The maximum speed-up a task universal solver can achieve in finding an optimal solution h to a task from training on a dataset D iswhere is the algorithmic mutual information between the data and the solution. We will give detailed definitions and proofs in the next sections; for now, we call attention to the fact that data can make a solver exponentially faster, consistent with our view that learning transduction can be equated to amortizing inference computation.
1.3. Scaling Laws for Speed
Having established that past data can speed up universal solvers, we examine how much speed-up is achievable as a function of the training dataset size. This requires making modeling assumptions about the data generating process or underlying mechanism.
A common assumption, including in Solomonoff’s work, is that real data, while complex on the surface, is generated through mechanisms or phenomena with low intrinsic complexity. This would intuitively suggest that there are ‘common reusable components’ we can learn from past data to help future reasoning. This intuition is, however, severely misleading, since the maximum speed-up obtainable by a solver is bounded by the complexity of the data generating distribution:
Theorem 2
(Maximum Speed-up Bound)
. The maximum speed-up in an optimal solution h of a task sampled from a data generating process q—from which the training dataset D∼q is sampled— iswhere the Kolmogorov complexity is the length of the shortest program for q. Notably, if the data was generated by the Universal Prior (as in Solomonoff Induction [
7,
8]), there would be precisely nothing to learn (zero information). This challenges the fundamental assumption in generalization theory that
simplicity aids learning [
9]. While simplicity benefits
explainability, it does not necessarily improve learning effectiveness. That simpler models generalize better is a consequence of how generalization is defined in the inductive setting; recent developments in LLMs have shown that simpler models are generally less effective at transduction.
From the results above, we see that the effectiveness of learning—and the asymptotic validity of scaling laws—hinges on the data generation process having effectively unbounded complexity. If complexity were bounded at
, scaling laws would plateau there; yet, empirically, we observe non-saturating power-law scaling [
10,
11]. This characteristic power-law trend is captured by Hilberg’s law for human-generated data [
12,
13]:
Definition 1
(Hilberg Scaling)
. Let be a training dataset of n tokens and be a test set of n tokens; then,grows unbounded according to some distribution-specific rate . We introduce a generalization of this conjecture for arbitrarily sized and and prove the following scaling law for speed after learning from data:
Theorem 3 (Power Law of Inference vs Training)
. Let h be a chain-of-thought trajectory solving a task and let be its length. If the generalized Hilberg’s law holds, the log-speed-up from training on n tokens is This result provides a
strong theoretical justification for the empirically observed power-law scaling of inference time versus training time in reasoning LLMs [
11] and can also be used to predict the scaling of space-bound LLMs (when the number of weights, rather than data, is the limit), thus providing guidance on how to scale resources when training universal solvers.
1.4. Inversion of Scaling Laws
The results described so far have dealt with the best model that could be learned from the data as we scale up. Empirically, models follow predicted power-law trends, suggesting near-optimal learning. But is this necessarily true? Is bigger always better?
Current scaling laws using prediction error (or perplexity) are sometimes used as a proxy for intelligence, arguing that more data, bigger models, and more computing will lead to “super-intelligence” [
14]. But, counterintuitively, as models become more powerful, learning becomes unnecessary since the model can rely more on exhaustive computation rather than insights from learned structure in the data. As ordinary scaling proceeds, better and better performance on benchmarks may come with less and less insight, all the way to the limit where infinite resources allow solving any task by brute force without any learning. More precisely, emergence of “intelligence” (in the sense of
inter legere, “to gather from among (the data)”) goes hand-in-hand with optimizing a solution under time constraints.
Theorem 4
(Learning and Time Optimization). Without time penalties, optimal inference can be achieved brute-force without any learning or insight. Conversely, any system that optimizes time must learn at least bits from past data.
The results above reveal that plots of accuracy-versus-size, routinely used to predict progress towards “super-intelligence,” can be misleading. By ignoring the cost of time, they encourage savantry over intelligence. Intelligent behavior should instead be measured by success rate per unit time/computing. Properly accounting for the cost of time using the net reward, we see that an optimal agent balances time and accuracy rather than blindly maximizing reward through brute-force search. This optimal operating point depends on the task, the environment, and the particular user’s needs. Hence, it cannot be predetermined during training but rather must be calibrated and controlled by the user at deployment to fit the situation at hand.
The theorems above highlight a core tension: in inductive learning, achieving generalization relies on
minimizing the complexity of the trained model—that is, the amount of information it shares with the specific training set—in keeping with information theoretic generalization bounds [
15,
16]. In universal solvers, by contrast, we seek to
maximize the information the trained model acquires from the training data—within the limit of what the model can store and retrieve. Such information is beneficial because facts, patterns, or heuristics that were irrelevant to the training task may turn out to be handy when reasoning to solve future tasks.
1.5. Maximizing a Continuous Reward
Thus far we have focused on systems that need to reach a verifiable goal. More generally, a system may be trying to optimize a continuous reward function . This introduces two challenges: we do not know when we have found the maximum, and we do not know if the increase in reward from further exploration will compensate the cost (which also depends on the user). Note that, while in reinforcement learning the solver is tasked with finding a policy for a particular environment (which may be partly unknown), here, we seek solvers that can operate in any environment, each with its own reward; when the reward function itself is unknown, we must again face the problem of learning it inductively.
If we have no way to estimate the predicted improvement in reward as a function of additional effort, tree search algorithms are worst-case optimal [
17]. If, on the other hand, we can forecast future rewards [
18], we show that the problem can be formalized as a Pandora’s box problem [
19,
20], which provides both an optimal search strategy and an optimal stopping rule.
In the next sections, we expand on the summary above in greater detail.
2. Dynamical Systems as Computation
Let us consider an example of an LLM solving a task using Chain-of-Thought (CoT):
Example 1.
What is the largest zero of ?
I need to find the zeros of the polynomial . Let me try to factor this polynomial. First, I check if there are any rational roots using the rational root theorem.
The possible rational roots are the factors of 120 divided by the factors of 1, which means the possible rational roots are: , , , , , , , , , , , , , , , .
Let me test some of these: For . For . For . ✓
So, is a root! This means is a factor.
Now I perform polynomial division to find the other factor: .
Expanding the right side: .
Coefficient of : , so .
Coefficient of x: , so ; thus. .
I continue factoring by finding two numbers that multiply to and add to 7. The numbers 12 and work perfectly, since and . This means I can split the quadratic into .
Combining all factors gives , revealing the zeros at , , and . The largest zero is at 5.
Clearly, this chain of thought is acting as some kind of computational device. However, it is nothing resembling a program since the results of the computation are intermixed with instructions (code) and comments, e.g.,
It is also not the trace of execution of an underlying algorithm since the code is generated dynamically based on the output of previous operations. Also, unlike standard programs, the particular CoT tokens are not necessarily meaningful; just outputting dots “…” may still lead to the correct result [
21].
A more sound view is that an autoregressive LLM performing CoT is a stochastic dynamical system, where the prompt represents the initial state, the trained model backbone represents the transition probability, and each CoT trace is a sample trajectory. CoT performs computation in the sense that starting from an initial state, the system evolves until it reaches a terminating state (one where the network is confident it can answer), at which point it outputs a final answer.
Of course, it is well understood that a (deterministic or stochastic) dynamical system can solve computational tasks (Deterministic Finite Automata, Turing machines, Game of Life, etc.). But LLM systems are quite peculiar. They were not designed to solve a specific task but rather to be universal solvers: given a description of a task, the system should be able to find a solution. Moreover, rather than brute-forcing a solution, we expect it to find the fastest way to solve the problem, as well as access past information (e.g., the “rational root theorem” in the example above) to significantly speed up the solution.
2.1. Notation
In this section we introduce the notation used throughout this paper. (For those with a background in dynamical system theory, in
Appendix B, we map the notation to one more reminiscent of continuous dynamical systems). Let
be a state in a potentially infinite state space
and let
be a time index. A sequence of states
is called a trajectory or path. Its length is the time
. A stochastic dynamical system is defined by the transition probability
. We say that
h is a trajectory between two states
if
and
. The probability
of a trajectory is the product of the transition probabilities along the path.
The system should be able to read inputs and output answers. Let be the input/output alphabet. We assume that the system has a set of terminating states and a function that, given a terminal state, generates an output. We also assume that there is a function that encodes the input into a state of the dynamical system, where is the set of all possible finite trajectories. We say that a trajectory terminates with output a, which we write as —if and .
Let be an input. For simplicity, we assume that all trajectories starting from and ending in a terminating state terminate with the same output, or with a special <error> token. This allows us to write , meaning the dynamical system starting from terminates with a. While this is generally restrictive, we mainly study settings where the answer is verifiable, in which case we can trivially return error if the output is not correct. (A more standard and less restrictive definition is to ask that the answer is correct at better than chance level; that is, we would say if ).
Example 2.
Two key systems we are interested in are as follows:
LLMs. The state is the set of activations of the LLM after reading some tokens. This is the set of activations (known as the key-value (KV) cache) for an autoregressive Transformer, or more generally the hidden state for a State Space Model. The transition function generates the next token given the state and uses it as input to generate state . The final states are the states at which the LLM outputs an
<end_of_thought> token. The decoding function consists of letting the network generate the answer after
<end_of_thought>. The encoding function simply lets the LLM read the input tokens to update its initial state.
Turing Machines.
The state of a Turing machine is the content of its tape at a given time, plus its internal state. The transition function updates the tape and its internal state as usual, either deterministically in a standard Turing machine or randomly in a probabilistic machine.
We also make use of several notions from algorithmic information theory [
5]. Let
x be a string, its
Kolmogorov Complexity is defined as the length (in bits) of the shortest program that terminates outputting
x. Given two strings
x and
y, their
algorithmic mutual information is
(up to logarithmic additive terms). This can be interpreted as how much more
x can be compressed if we have already observed
y. Recall that, by Shannon’s Coding Theorem, given any probability distribution
over binary strings, there is a corresponding encoding algorithm that encodes a string
x in
bits.
2.2. Proper Time
As we have anticipated, transductive learning is about solving generic tasks quickly. But how do we define the time that a stochastic system needs to find a solution to the task? The question is subtle, since if we look at the length of a particular sampled trajectory, randomness can make an algorithm look arbitrarily faster or slower, without changing what it effectively computes.
Let us first consider a motivating example. Let be a function that is easy to evaluate, but can be inverted only through brute-force search (i.e., a ‘one-way function’). Given y, the task is to find a binary string x of length such that . A deterministic Turing machine must try all candidates for x, so the total expected time is . On the other hand, a stochastic machine can guess the first k of x, and brute-force the remaining , so every terminating trajectory has length , but occurs with probability only .
![Entropy 28 00332 i001 Entropy 28 00332 i001]()
The probabilistic machine can then be arbitrarily faster than the deterministic machine as measured by the trajectory length even if, effectively, both are doing the same brute force search. If, however, we consider the ratio , we see that it remains constant, no matter how we branch the computation path: randomness shortens paths but also makes them rarer. This invariance suggests the following definition of “proper” time for single-trajectory targets.
Definition 2
(Proper Time)
. Let be a stochastic dynamical system. Define the proper time
to reach v from u aswhere the minimum is over all trajectories h from u to v, or if no trajectories exist. For an input–output specification, the proper time to terminate from input x with output a iswhere the minimum is over trajectories starting from and terminating with output a. This definition is closely related to Levin Complexity [
22] and its extension to tree search [
17]. If the system is deterministic, then
for any path, and
reduces to standard running time. Conversely, we now show that
indeed captures the actual computational effort required by a stochastic dynamical system when simulated deterministically.
Theorem 5
(Dynamical Systems ⇒ Turing machines with same
)
. Let be a dynamical system. There is a deterministic Turing machine , with access to an oracle to compute , such that The theorem follows directly by taking to be the Turing machine that implements the algorithm in the following key lemma:
Lemma 1
(Levin Tree Search [
17])
. Let be two states. There is a deterministic algorithm A that discovers a path between them (if it exists) while visiting at most T nodes, where Proof sketch. We sketch the argument underlying Levin-Tree Search [
17]. For a partial trajectory (search-tree node)
, we define its Levin cost:
The algorithm maintains a frontier of unexpanded prefixes and repeatedly expands the prefix having minimum
. Along any root-to-leaf branch,
is nondecreasing (depth increases while probability only decreases), which implies a best-first property: all prefixes with
are expanded before any prefix with
, where
is the optimal goal cost. Let
be the (finite) search tree consisting of all expanded nodes at the moment the first goal node is expanded. Every leaf
h of
satisfies
; hence,
. Moreover, the number of expanded nodes is at most the sum of the depths of leaves,
, since each leaf contributes at most one count to each of its ancestors. Therefore,
assuming that leaf probabilities in a prefix tree sum to at most 1. Thus, Levin Tree Search reaches
v after at most
node expansions. □
Since all computation today is executed on deterministic logic hardware, Theorem 5 validates as the “proper” way to measure time for a stochastic dynamical system. (The name has a loose analogy with relativistic proper time: like , our mixes temporal and probabilistic ‘distances,’ providing a representation-invariant clock). It also frames as a fundamental property of the algorithm we are executing, rather than a function of the stochasticity of its implementation.
Remark 1.
It is useful to compare proper time with other candidate measures of computational cost. Consider a system that, with high probability , terminates in a short number of steps but, with small probability ϵ, and enters an infinite, non-terminating, chain where every transition has probability . If we sample a random trajectory, the expected time is infinite: even though the non-terminating trajectory is entered with low probability, its infinite length dominates the expectation. (The same conclusion holds even when all trajectories are forced to terminate. For example, suppose a system has probability of entering a trajectory of length . Then, ). Likewise, the Levin-style quantity diverges: along the non-terminating chain, the trajectory probability stays bounded away from zero (since each transition has probability one), yet the time grows without bound, so their ratio diverges.
The proper time , by contrast, remains bounded and close to . Intuitively, proper time discounts trajectories by their probability of being reached, rather than conditioning on having entered them, and so it is not held hostage by low-probability pathological paths.
This distinction has a practical consequence: to harness the computational power of a stochastic system, it is not enough to sample trajectories from it. Instead, a dovetailing strategy, as in Lemma 1, should be used (in the next section, we also show that a sampling approach with a suitable restart schedule can achieve close to optimal time). More broadly, this highlights that a standalone LLM is not itself an optimal solver since its time to terminate can be arbitrarily large. Rather, an optimal solver agent is constructed from the LLM by wrapping it in an appropriate search procedure. (This construction requires the ability to restart or roll back trajectories, a property naturally satisfied by today’s dominant agentic settings—code execution with unit tests, theorem proving with proof checkers, and tool-augmented reasoning in sandboxed environments—where the agent’s actions do not have irreversible side-effects on the external world).
Note that using Levin Tree Search increases the memory requirements compared to the original stochastic system since we have to explore multiple states in parallel. The total amount of states to keep in memory is bounded by the proper time τ itself (Lemma 1), and the bound is tight for a shallow tree with a high branching factor. However, note that time guarantees similar to Levin Tree Search can be obtained without increased memory usage by, for example, using sampling with a restart schedule.
In a deterministic system, the time (path length) between states is a distance. Similarly, the following theorem establishes that for a path , the proper time to go from cannot be greater than the time it takes to first go to y and then to z. It will play an important role in multiple proofs.
Lemma 2
(Proper Time is submultiplicative)
. Let be three states. Then, Proof. Let
and
be paths that realize the minimum in the definition of
. We can construct the path
composing the two paths. By Definition 2, we have
where in the second line (2), we used the fact that
whenever
and
, which is automatically satisfied when the states are distinct. If two or more states are the same, the property can be easily checked by hand. □
This also implies that , which makes an asymmetric distance between states. Note that is sub-multiplicative, while deterministic time is sub-additive. This is because in a stochastic system, time may be dominated by the time to find a suitable combination of paths, and probability of the composition of two paths is a product.
These two lemmas suffice to prove the key theorems in the rest of this work, including providing a straightforward construction for a generalization of Solomonoff–Levin Universal Search Algorithm [
2,
3].
2.3. Multiple Successful Paths
The quantity in Definition 2 measures the cost to
uncover a particular trajectory. Many tasks, however, accept
any trajectory leading to one of the final states
. In that setting, multiple distinct paths can succeed, and the right notion aggregates their probabilities. Let
be the
success-by-time curve. If we run independent trials of length
t (restarting after
t steps), we need in expectation
trials for one success, for total expected work
. Optimizing over the cutoff gives a canonical baseline:
This general notion (i) strictly improves over any single-path bound when many solutions exist, and (ii) collapses to the proper time when there is effectively one successful path. In principle, to simulate the stochastic system in total time
, we would need the unknown optimal cutoff
t. However, universal Luby-type restart schedules [
17,
23] achieve expected work within a logarithmic factor of the optimum fixed-cutoff policy:
Thus,
characterizes intrinsic difficulty ‘up to logs.’ For clarity of exposition, in the rest of this paper, we focus on
, but all results extend naturally to
.
2.4. Universal Dynamical Systems
A key part of defining a computation system is the existence of universal systems. Generally, we want a system in the class to be able to simulate any other system in the same class. However, since time is a key quantity for us, we need to ensure the time to simulate is similar to the original time.
Definition 3
(Linear-Time Universal Dynamical System)
. Let ν be a dynamical system and let be its encoding. We say that dynamical system U is linear-time universal if for any ν, we havefor some constant that depends on ν but not on the input x. Since for any dynamical system
there is a Turing machine
which simulates
in the same proper time, to satisfy the definition it is enough to verify that for any Turing machine
M, we have
which is generally easier to verify. This observation trivially gives us the following.
Corollary 1.
There exists a Linear-Time Universal Dynamical System.
Proof. Any linear-time universal Turing machine is a dynamical system and by definition can emulate other Turing machines in linear time. □
3. Universal Solvers
We are now finally ready to introduce universal solvers, which are our main focus. A universal solver is a dynamical system that can efficiently find a solution to an arbitrary problem, if one exists. We formalize it as follows.
Let
be a computable function. We say that
y is a witness of
x if
. A universal search program is any program
S that, provided with an oracle for
f and an input
x, terminates with outputting
y such that
(we generalize this to continuous rewards in
Section 6):
If
y does not exist, the program is allowed to terminate with an error or not terminate at all. Generally, together with the input
x, we may also pass a description of the objective
so that the search program is not blind. To keep the notation uncluttered, we do not denote this additional input.
It is always possible to find a witness to any problem by just enumerating all possible y in a dovetail fashion and checking for using the oracle for f. However, we are interested in search programs that are as efficient as possible.
Definition 4
(Universal Solver)
. A dynamical system U is a universal solver
system if for any objective and any other system A that solves the problem—i.e., for all x, with —we have That is, for any task, a universal solver is at most a constant factor slower than the best possible system
A that solves that particular task. The existence of a universal solver is non-trivial. Levin introduced the notion of universal search, as well as a sketch of the existence of such a system in the same paper that introduced the notion of NP-Complete problems [
2]. Solomonoff later realized its importance for machine learning and provided a detailed proof [
3]. With the formalism we introduced, the existence proof is straightforward and can be generalized to any stochastic system, with Turing machines as a special case.
Theorem 6
(Existence of Dynamical System Universal Solvers)
. Let m be any distribution encoding programs from which we can sample. Then, there is a dynamical system such that for any solver A,In particular, is a universal solver with constant . Proof. Let U be any linear-time universal system as in Definition 3. Construct a composite dynamical system as follows. First, given x, use m to sample a program encoding, call it , and append it to the input to get . Then, run the universal system U to execute on x.
Let
A be any algorithm that solves the search problem. Then,
, giving us a path
from the input
x to the solution
y. Applying the submultiplicativity of
to this path from Lemma 2, and using the definition of universal dynamical system (Definition 3), we have
Note that by definition of
we have
(where we assume that the entire program
is sampled in one step). Replacing this identity in the above, we get
which gives the desired time bound. □
The construction above instantiates a particular universal solver that first ‘guesses’ a program that may solve the task and then executes it. Of course, in general, universal solvers need not be a one-shot guess: human problem-solvers will not blindly guess an algorithm and execute it but will rather interleave partial computations, observations, backtracking and shortcuts. Our general stochastic dynamical-system view already subsumes such interactive behavior. Nonetheless, the search algorithm presented is universal (as in, no other can be significantly faster) and already allows us to make some general observations:
Speed of a universal solver. For any solver
A that succeeds on
x, the universal solver above solves the problem in time:
That is, the slowdown with respect to an arbitraty solver is the simulation constant
times the inverse prior weight of the right program. This bound highlights two levers for learning. The term
depends on the code length
: if tasks reuse a small set of subroutines, reshaping
m to give these short codes yields exponential gains (we return to this in
Section 4). The factor
reflects how many steps our base dynamics spend simulating a single step of
A; when particular transition patterns recur, we can ‘macro-step’ them—effectively adding shortcuts in the dynamics—to shrink
. These ideas extend beyond the guess–execute prototype to any universal search system. We will study them in detail in later sections.
Universal Search and Universal Computation. From the proof of Theorem 6, we see that starting from a universal system, we can easily construct a time-optimal universal search system. The following straightforward theorem shows that the converse also holds. For a system to be a universal solver, it needs to be a universal computation system.
Theorem 7
(Universal Search ⇒ Linear-Time Universal System). Let U be an optimal universal search program. Then, it is also a universal dynamical system.
Proof. Let
M be a Turing machine. Construct the function
if
and 0 otherwise. By the definition of optimal universal search, we have
, which implies
by construction. Moreover,
therefore, it is linear-time universal. □
The claim is straightforward but it has an important implication:
if we train a model to solve a sufficiently general set of tasks, the model will necessarily learn to simulate a universal Turing machine. Whether this has already happened in the current generation of LLMs pretrained on language has been hotly debated. Depending on the exact setting and access to tools or external memory, one can argue one way [
24,
25,
26] or the opposite [
27,
28]. Our result establishes beyond debate what is possible.
Optimality through meta-learning. An important property of a universal search system is that it is necessarily optimal, meaning that no other universal search system can be arbitrarily faster:
Lemma 3.
Let U and be two universal search systems. Then, for any task f, we havewhere the constant does not depend on task f or the input x. Proof. Since by universality
finds the solution to the task
f, we can take
in the definition of universal search, giving us
Hence,
U is at most
times slower than
, where
does not depend on the task
f. □
The proof is a trivial manipulation of the definitions, but it underlies a key concept that in modern terminology would be called meta-learning. Let us use the particular universal system in Theorem 6 to make the point explicit. For this, the time required to solve a task depends on , the encoding length of its optimal solution. It is a priori possible that a system may achieve a better time on some tasks by learning a better encoding specific for them. However, U can just search (meta-learn) the solver and use it to solve the task, leading to a slow down of at most . In practice, the constant is too large, and we need to amortize it through learning, which is our focus for most of this work.
Universal Solvers and Sampling
By Theorem 5, we can convert a stochastic system to a deterministic program that finds a solution in time . However, this program is not obtained by naively sampling a random trajectory up to completion, as one may be tempted to do. In fact, doing that would have an infinite expected runtime:
Example 3.
Let ν be any computable prior that gives non-zero mass to all programs (e.g., the Universal Prior); then, , even assuming that we have an oracle preventing us from running algorithms that do not terminate. To see this, consider the program A that computes and runs for steps before terminating. Then, and there are infinite such programs in the expectation.
This highlights an important principle: if we have a way to guess a possible solution, in general it is not time-optimal to keep generating guesses and testing them. For an LLM, this means that sampling CoT traces until one succeeds is not a good idea. Rather, we need to keep open multiple possibilities and smartly allocate a time budget between all of them. To add some color, imagine trying to prove a theorem. You will likely start with the most likely guess, but if it starts to take too long with no solution in sight, you will try spending some time on another approach and perhaps come back to the original approach later.
The construction in Theorem 5 which achieves on a deterministic system can be made into a stochastic algorithm. The algorithm above hinges on keeping multiple hypotheses at the same time and continuing to explore them with increasingly more budget. What prevents us from having a system that achieves the same expected time by sampling individual trajectories?
We have seen before that such a system cannot sample programs directly from
as the expected time could easily be infinite. A good guess is that we need to sample from the distribution (this distribution is closely related to Schmidhuber’s Speed Prior [
29] and Filan et al.’s
prior [
30]):
where
Z is the normalization constant, which prioritizes programs that have a short running time. This is indeed the case.
Theorem 8
(Time-Weighted Sampling)
. Let ν be a universal search system. If we sample trajectories fromand run them to completion, the total amount of operations we need to perform before finding a solution is Proof. Let
be a trajectory solving the task and let
denote the number of iterations before
is sampled. In expectation, we have
. We now need to compute how much time is spent validating each of the
samples. The expected time that we need to spend validating a single sample from
is
so the total time we need to spend validating the
is
which gives the desired result. □
Hence, a universal search algorithm that only wants to consider one guess at the time has to learn how to sample from , which means that in addition to estimating the probability that a solution is correct, it should also be able to predict the time that it will take to run it.
The distribution
is actually computable (the time
may be undecidable, but to upper bound
within
, we just need to show that
and, hence, run for
steps). However, in [
30], it is shown that
takes double exponential time in
to approximate, and this essentially requires running multiple programs, which we want to avoid in the first place.
Hence, the only option left if we want to avoid searching over trajectories is to train a system to approximate both the likelihood of the solution and the cost of time. While this will not be our focus, the following importance weighted training scheme gives a way to train.
Theorem 9
(Importance-Weighted Training)
. Let ν be a dynamical system. Let be a batch of trajectories sampled from ν. Then, the distribution minimizingis exactly . 4. Scaling Laws for Speed
By the definition of a universal solver, given a function
and an input
x, there is a trajectory
h finding a witness
x if such a witness exists. Let
h be the shortest such trajectory, i.e., the one with minimal
. The total time required by the universal search system to find it is
where we define
to be the compression length of the trajectory using
. How do we reduce the search time
? We could reduce the thinking time
by learning to skip some steps to get directly to the solution. But the largest improvement will come from reducing the exponential factor
. This is the time needed to
guess the correct solution to the problem. Thanks to Shannon’s Coding Theorem, we can improve the probability of guessing the solution, thus speeding up the search, by instead finding a way to
reduce the compression length of
h. We can do this by learning from a dataset
D.
For example, suppose we have a list of programs that have worked well in the past. If we notice that some pieces of code tend to appear frequently (say, the code to compute an FFT), we could change the encoding to replace those pieces of code with a unique name. This reduces the length of those programs, making them more likely to be sampled. Not only that, but any program reusing those components is more likely to be guessed in the future.
Another example to add color: Suppose that while proving theorems, we often use the same sequence of steps. We probably will want to turn it into a named theorem—e.g., “Cauchy–Schwarz inequality”—which will also make us more likely to try to use it in future problems. In this spirit, let us crystallize this in the following:
Theorem 10
(Better compression ⇔ Faster search). For a universal search system with model ν, improving the compression of a trajectory h by Δ bits accelerates its discovery by a factor of .
Let us now formalize what learning from data means. Given some data
D, we denote by
the negative log-likelihood given by the model to a trajectory after observing dataset
D. One possibility is that we
train on data
D. In this case, we assume that
is a parametrized family of distributions. Let
be the parameters obtained after training on
D. Then, we define
as the likelihood given to
h by the trained model. Alternatively, we can do
in-context learning (ICL) or
prompting where we feed data
D to the model to obtain state
, and then we set
the likelihood of the trajectory after having seen the data. It could also be that the model
has a way to retrieve information from
D, a process known as
retrieval-augmented generation (RAG). Any mix of these methods may be used (some data is used to train, some to prompt, and some for retrieval). While different in implementation, from a theoretical perspective, there is no fundamental difference between these ways of using past data, which is why we can write generically
.
In our setting, the benefit of learning is not measured by better accuracy—since we have a verifier, sooner or later, we
will find a correct solution—but rather by the reduction in search time. The speed-up factor achieved after training on the data is given by the ratio
where we define the
-algorithmic mutual information:
So, the speed up of universal search is given by the algorithmic mutual information between inference time trajectories and past data. We highlight this in the following.
Theorem 11
(Information is speed)
. The log-speedup of a search algorithm after seeing data D is Remark 2.
Theorem 11 follows easily from the straightforward algebraic manipulation in Equation (17). This is possible thanks to the work we already did in defining and validating the notion proper time, which connects the likelihood of trajectories in a dynamical system (which relates to information) with the time necessary for the system to find a solution.
We are interested in
representing very good compressors (since we want to minimize
). Asymptotically, the best compressor is the universal prior
, for which
becomes
the algorithmic mutual information [
5]:
While we are interested in
, we can use
as a proxy of what is the best we could achieve asymptotically. The advantage is that
has a number of theoretical properties that make it easier to work with.
The key question now is as follows: What is the maximum possible log-speedup we can get from learning? As it turns out, the answer is not straightforward and depends on some key assumptions about how real data works. Let us get there step by step.
The trajectory h is the trajectory of an optimal solution to a task (e.g., the optimal CoT to get to a solution, or the shortest execution trace of a program that solves the problem). Meanwhile, is presumably created by collecting examples of trajectories that optimally solved tasks in the past.
A first guess (often done in the Minimum Description Length literature) is that solutions to real world problems tend to have low complexity. It therefore may make sense to hypothesize that
is sampled from the universal prior itself, which favors low-complexity solutions. What would be
in this setting? Disappointingly, we can show that
so the probability that past data
D share substantial information with the solution to the present task
h and therefore can lead to substantial speedup through learning is vanishingly small. This does not bode well for the possibility to learn a fast universal solver.
To see what happened, it is useful to abstract a bit. Suppose we have a mechanism generating trajectories. Let be data seen in the past (our training set), and let be new data we are trying to find at inference times. This forms a graphical model:
where
q acts as a separator between past and future data. By the Data Processing Inequality [
4], this implies
That is, since
is sampled i.i.d. from
q, the only information that past data
D can provide about
is information about
q itself, and this cannot be larger than its description length
.
Theorem 12
(Maximum speedup is bound by world complexity)
. The maximum speed-up we can obtain using data generated by process q is Since the universal prior has low Kolmogorov complexity , there is nothing we can learn from it. (This may be slightly confusing; m can generate programs of arbitrary complexity, but its own complexity is low. In fact, we just need a few lines to define it). More generally, whenever the data is generated by a low-complexity distribution, no matter how much data we observe, we will never be able to obtain more than a constant time speed-up.
This gets to a key question about what is the scaling law of information for real world data. To study it further, it is useful to reframe the question a bit. We have been thinking of q as a mechanism that generates i.i.d. samples of trajectories. This may be restrictive. More generally, let q be a dynamical process generating a sequence of tokens. Let and be an initial sequence of length n and its continuation of length m. We can think of as our training set (past data) and as our test set (future data we are trying to predict). It may be useful to think of and as being natural language text or code.
We want to know how
scales when
. Let us suppose
q is a finite-dimensional Markov process with a discrete
D-dimensional hidden state
over some alphabet
S. What information can
provide about
? By the Markov hypothesis, the only information that
can provide to help predict
are the parameters
of the underlying process and the final state
, so we have
where
is the number of parameters of the process and
c is how many bits we need to encode the parameters. Again, we find that for a very large class of processes,
is bounded by a constant, and, asymptotically, there is nothing to learn as long as (i) the parameters of the process are finite-dimensional and (ii) the size of the state is bounded (or, equivalently, the process has finite or fading memory).
4.1. Hilberg’s Law for Scaling
Is this what happens on real data? A particularly well studied case is when the process
q generating the data is a human writing natural language text. In the special case that
,
Hilberg’s law [
12,
31,
32,
33,
34,
35] posits that
for some
. This is in sharp contrast with the results above. If Hilberg’s law holds (which, empirically, it does [
36]), then the process generating real data is very unlike any standard dynamical process. (An unrelated consequence is that a pure LLM implemented by a
state space model or an attention model with finite context cannot possibly be a perfect model for natural language. Since its state is bounded, it satisfies Equation (24) and cannot asymptotically scale like natural text. However, RAG sidesteps the issue, so an agent with external memory can be a model of language, or a model of the world, in ways in which an ordinary Transformer cannot no matter how many parameters it has and how much data it is trained on).
Since we care about real data, let us introduce the following generalized Hilberg’s law (GHL) scaling to arbitrary n and m and take it as our assumption of how physically generated data, including human-generated data, behave.
Definition 5
(Generalized Hilberg’s law)
. Let and be data generated by a stochastic process. We say that it has GHL scaling if This expression reduces to the standard Hilberg’s law when
. It is symmetric and always positive. (Define
and
. The function
is convex, so
). To get an intuition of how a process may satisfy the GHL, in
Section 4.4, we will show one explicitly based on the
Santa Fe process [
34,
35]. The key intuition will be that the GHL is satisfied whenever the “world” (whatever is generating the data) contains an unbounded amount of unchanging
facts that are referenced in the data with a heavy tail distribution. For now, let us assume that our process satisfies the conjecture and derive the scaling laws for speed-up of a universal search agent.
4.2. Scaling Laws for Time
Assume the training data
and the inference data
are generated by a process satisfying the GHL:
We are interested in the case where
is the training set, so
, in which case, we can approximate
From Theorem 11, the log-speed-up we get from training is exactly
and
is the length of the inference time trajectory. Therefore, we conclude as below.
Theorem 13
(Time-Scaling Law)
. The log-speed up we obtain training on a large enough dataset D of n tokens is This tells us a few interesting things. First, the speed-up is upper-bounded not by a constant (like we previously obtained for simple models) but by
. That is, the longer the trajectory is, the more it is sped up by learning. This makes intuitive sense: if finding a solution required just a few steps, even without any learning we could have brute-forced it quickly. Complex problems are the ones that benefit the most from learning. We also get
convergence to the optimal speed-up, so we want the number of training tokens
n to be
where
L is the maximum length of the trajectory we expect to need to solve a problem. That is, we need more training tokens if we expect to solve challenging problems.
The parameter relates to the complexity of the task distribution; in particular, it controls how long-tailed the distribution of ‘useful facts’ is, with implying that the distribution is very heavy-tailed. When is high, we need significantly more training tokens to achieve the optimal rate since there are many more facts that are commonly used. But we also get a better payback since the speed-up is also going to be larger.
Remark 3.
For natural language, [36], which gives . So, if we are going to generate trajectories of 10K tokens, we need ≈100K training tokens. This ratio of test to train data is realistic when fine-tuning a model for reasoning. But when training from scratch, it is a clear underestimate. There are a few factors to consider. First, the initial training data is needed to put the weights in a proper configuration, which depends more on the amount of weights than on the information in the training data. Indeed, it is common to pretrain on lower-information content with size proportional to the number of weights. Second, we are assuming that the mechanism generating the test data is the same as the training data, which is not the case. Facts that are useful at test time may appear very rarely in the training set (e.g., if we ask PhD-level questions to a model trained on generic data). Third, the scaling laws are derived under the assumption that we can identify useful facts and memorize them the first time we see them. But, realistically, we need to see a fact multiple times to identify it as useful, which inflates the number of required tokens.
4.3. Memory-Time Trade-Off
So far we have assumed that we can use all information in X, but, in practice, the available memory M may be a bottleneck. On a dataset of length n there are facts to memorize, requiring bits of memory. Replacing n with M using this relationship in Theorem 13, we get the scaling if memory (rather than n) is the bottleneck.
Corollary 2
(Time–Memory Scaling Law)
. Assuming memory is used optimally, the speed-up as a function of the used memory is given by However, this assumes that we are somehow able to extract from the training data the most useful facts and store them (and only them) in memory. Since we are using an online learning algorithm, the memory also needs to store information about the facts in the training data that we have not yet deemed useful, since we need to wait to see them again to confirm if they are useful.
Proposition 1
(Online Memory Overhead)
. An online agent needs a constant factorof additional memory compared to offline to achieve the same performance. This reflects a realistic issue: it is easier (i.e., faster) to learn from a textbook that gives us directly the useful facts (offline learning) rather than having to ‘connect the dots’ and try to guess the useful facts from online experience.
Prompting and RAG
So far we have focused on the speed-up provided by training on dataset
D of past data. What is instead the effect of adding prompt
p to the request? First, note that the key result,
where dataset
D is replaced by prompt
p, remains valid, so the speed-up is still determined by the
-algorithmic mutual information between the prompt and the trajectory.
If the prompt is an in-context learning prompt, which provides examples of the task, then the theory is identical to the case of a dataset (effectively, the prompt is a dataset). However, we expect it to provide much more algorithmic information per-sample than pre-training dataset D since presumably it will contain only examples directly relevant to the task.
The prompt could also contain information directly relevant to the trajectory, which does not follow a GHL scaling law. For example, if the prompt is a
plan describing exactly what to do, then
and we get the maximum possible speed-up, meaning that the time to execute the search becomes merely
, the minimum possible time required by a trajectory to solve the task.
Alternatively, a prompt p may not specify the whole trajectory, but all the information that it has may be relevant to the trajectory; that is, . In this case, we get a significant speed-up . For example, just 10 good bits of prompt (a few tokens) can reduce the time to find a solution by ∼1024 times. We can think of this as a useful hint (“Solve the following task. The following technique may be useful: …”) that brings down the time to solve a problem from hours to minutes.
4.4. Example of GHL Scaling: Santa Fe Process
So far we have assumed that our data generating process satisfies the Generalized Hilberg’s law scaling
, and we anticipate that this relates to having an infinite distribution of facts that appear in the data following a long-tailed distribution. Following [
31], we now explicitly construct such a process, showing that the GHL scaling definition makes sense and how exactly ‘facts’ relate to scaling.
Let
be an infinite set of binary properties that are sampled
before any text is generated. We can think of them as facts about the world, which may be referenced in the text. Importantly, since the
are sampled only once at the very beginning and do not change over time, once a fact is first encountered, we know its value in any future text. (An alternative view is that the process has extremely long memory: after the first time it generates a value for
, it remembers it and reuses at all later times). Some facts are referenced very often, others very rarely. Empirically, natural frequencies are well captured by a Zipf power law:
for some normalization factor
c.
To generate a sequence
X, we concatenate the index of a random fact and its value:
where
. We generate
Y similarly.
Theorem 14
(Santa Fe Process GHL Scaling)
. The Santa Fe process described above follows GHL scaling: Proof. We now want to show that for this process,
follows GHL scaling. Let us first rewrite
To compute the compression cost
of
X, the key observation is that we only need to encode the value of a fact the first time we see it (since it remains constant). How many unique facts appear in
X? Asymptotically, this is given by
Using this, the compression cost
is the cost of encoding the
n random indices (using
bits per index), plus the cost of encoding the
unique properties:
The cost of
and
are computed similarly. Putting all together, we get the desired result:
□
This construction is quite artificial—real world data are not a stream of random facts, indices, and immutable binary values. But it does highlight in a simple way some key, and more fundamental, issues: First, the process has
infinite complexity since it needs to store the value of all the
. Recall that by Theorem 12, this is exactly the setting where learning can make the most difference, since we can obtain arbitrarily large speed-ups. Second, the distribution of facts has to follow a power law. Some facts (“
the color of the sky”) are more referenced than others (“
the house number of John Doe”) and power laws are abundant in real data and have several theoretical justifications (rich-get-richer effect [
37], least-effort/max-entropy trade-off [
38], monkey-typing with intermittent silences [
39,
40], etc.).
In terms of reasoning traces, one may think of ‘facts’ as functions or theorems that one may invoke by calling their name (index). Since functions/theorems are reused and always remain constant; after memorizing them, we can significantly compress future reasoning.
This intuition has motivated a significant volume of research regarding universal solvers [
3,
41,
42,
43], program synthesis [
44,
45,
46,
47], and reinforcement learning [
48,
49].
5. Inversion of Scaling Laws
So far we have established the fact that learning from data leads to a speed-up in finding a solution to an unforeseen task. However, the equivalence
tells us something stronger: we learn to be faster
if and only if we learn from data. This suggests that we can learn something from data if and only if we train with a time optimization objective.
Let us work through an example. Suppose we want to train a universal solver, and (as is natural) we use as reward function the expected number of correct solutions, determined by a function
R, averaged over some distribution
of tasks:
Further suppose that our agent has unlimited computing power available, so that we have no need to optimize resources over their usage. What will such a system learn?
If the distributions of tasks is generic enough, we know that the system has to learn to perform universal computation (Theorem 7). But that is the only thing that it needs to learn. Having universal computation, it can implement the basic Solomonoff–Levin universal search algorithm, which will always find the solution to the task, thus achieving maximum reward. It will take eons to find the solution, but since computing is free for this agent, that is not a problem.
To further clarify, suppose we want to teach the model to play chess. Training is not necessary to achieve a better reward, since a standard tree-search over all the possible moves will eventually find the best move to make. Training is required only to reduce the time that it takes to find the best move.
Remark 4.
Only time bound systems learn. If a system is not penalized for the time it takes to find a solution to the task, it is optimal to always brute-force a solution without learning anything. Vice versa, any system that optimizes time has to learn at least bits of information from the data.
Going more in depth, we can look at how we expect
to behave as we scale the model. First, note that as we scale the maximum time allowed for a trajectory, we also usually want to jointly scale the amount of weights of the model [
10]. So, if
T is the maximum time for the trajectory, the number of weights will be some monotone function
. Note that the number of weights puts a constraint on the maximum amount of information
about the data that we can store in the model parameters. But since
, this also puts an upper-bound on the per-trajectory information
. Next, let us look at how much information the model is forced to capture if it wants to have perfect performance on the task. We need
, so we need to store enough information to speed up the search until it takes less than
T total time. This means that
so as expected from the discussion before, the amount of information we
need decreases as
T increases. Putting the two bounds together, we obtain the curve for
, shown in
Figure 1.
As we scale the model, when the inference time budget and the number of weights are small, the model learns as expected, acquiring information from the data, storing it into the weights. The expected reward it obtains steadily increases as more problems become solvable within the allotted time budget. At some point, the amount of information the weights can store is large enough that, thanks to the speed up, all trajectories are solvable within the time budget. At that point the reward is always optimal and stabilizes. But, paradoxically, if we further increase the time budget, the model can use brute force search more, and the information it needs to acquire starts decreasing until it reaches zero. This is the savant regime, where the model can default to expensive brute force search with no learning whatsoever, yet still achieve the optimal reward. In this regime, optimal performance (orange curve) comes from excess capacity rather than “insight,” as measured by learned algorithmic information (blue curve).
To avoid entering the savant regime, one option is to penalize time using a reward like
This forces the model to actually learn algorithmic information. The training of many current large reasoning models [
50,
51] includes similar objectives that maximize reward while minimizing inference time. The results above suggest that this regularization not only reduces inference cost—a common motivation for penalizing time—but also improves the quality of what the model learns.
Note the sharp contrast with information theoretic regularizers. In a classic machine learning setting, we want to maximize [
52]
That is, we want to
minimize the amount of information in the weights [
6]. This ensures generalization, and is the basis of the Minimum Description Length (MDL) principle. For reasoning, however, we want to
maximize the information in the weights in order to minimize time. In transduction in a verifiable setting, there is no issue of generalization, since we have access to all relevant data, and a verifier can that tells us whether the task has been solved. More generally, the trade-off is not trivial.
Consider a real biological agent that has to model a physical environment in order to act. The agent may learn the correct, generalizable, underlying physical rules that would be optimal from an MDL perspective. But if those rules involve a large amount of computations, the agent may not be able to use them in a reasonable time for survival. It may instead be optimal for the agent to memorize several case-by-case rules (e.g., a feather falls with a certain speed) even if they do not generalize (a feather does not fall with the same speed in a vacuum), if they allow it to quickly come up with approximate, but timely, estimates that are better for survival than the best estimate rendered belatedly.
It is common to refer to transduction as
System-2 behavior and induction (restricted to a single forward pass) as
System-1 behavior [
53]. Encoding reasoning traces into the weights can then be thought as
automatization: the common thinking patterns are made faster by moving them to
System-1. Whether this is advantageous depends on environmental stability. In stationary environments, automatization allows faster reaction and better energy usage. However, in time-varying environments, the ability to reason at inference time cannot be fully replaced by a set of fixed learned behaviors.
6. Maximizing a Continuous Reward
So far we have studied universal solvers under the assumption that the reward function has a binary “success/failure” value. In the more general case, the universal solver is required to maximize a continuous reward function. If we know the maximum achievable reward , and we know that the reward can be achieved, we can define a new binary reward function “has achieved the maximum,” thus reducing to the binary case. However, generally, we do not know what is the achievable maximum and, more importantly, we do not know if the search cost to find a better solution will be worth the value. That is, a practical universal solver has to decide when to stop.
The objective underlying this decision is optimization of the user’s net gain. Consider a universal solver
U that over time outputs various candidate solutions
, and let
be the maximum reward achieved by any solution up to now. The value obtained by the user stopping at this point is
where
represents the best reward achieved across all explored traces,
is the per-token cost, and
is the total compute time consumed until now. The question is whether investing additional compute to search for a solution is likely to improve
J.
Deciding whether a reward function can be improved, let alone whether it is convenient to do so, is generally undecidable. Hence, we need to assume we have learned a
forecasting model to estimate the distribution of rewards for each potential continuation
h, enabling principled decisions about which reasoning paths to pursue [
18]. Under this assumption, we can formulate the search for optimal solutions as a
Pandora’s box problem with order constraints [
19,
20].
Universal search as Pandora’s box problem. We can visualize all possible computations to solve a task as a rooted tree . A node n represents a partial reasoning trace; exploring an extension of a reasoning trace (i.e., a child of n) for t tokens incurs a cost and, if we terminate after the extension, yields a terminal reward . Nodes naturally obey order constraints: a child may be explored only after its parent.
Given the current best obtained reward
, the
incremental value of exploring a new child node
n is
While we could greedily pick the next computation to perform by maximizing
, this generally leads to suboptimal solutions. Searching for the best possible strategy is, in principle, exponentially complex. However, Weitzman [
19] showed that a simple greedy strategy does exist to optimally solve this problem. In particular, Weitzman defines a
conditional reservation value for each candidate extension as the unique
solving
The resulting value of
is called the
Gittins index of the node. The provably optimal policy is then to visit at each step the unexplored node that has the currently highest Gittins index. It is instead optimal to
terminate the search when
at which point there is no node that we can visit that is expected to improve the final objective
J. That is, the expected improvement in reward does not compensate for the cost of exploring the node.
This framework can be extended [
20] to the case when there are constraints on the order of opening the boxes (e.g., testing a solution obtained after 1024 thinking tokens requires first reaching 512 tokens), in which case the value of a box also needs to take into account the value of the boxes it allows access to.
Making decisions with Gittins Indexes. Once we use the Gittins index , using our forecasting model, we have a simple criterion to make several key decisions during our search. For example, allows us to decide: (i) when to continue extending a reasoning trace (if the Gittins index of a child is the highest); (ii) when to branch a trace (if the child of a parent node achieves the highest index), in particular when it is optimal to restart from scratch by sampling a new reasoning trace); and (iii) when it is optimal to stop attempting to improve the solution of a problem and return the current best (when no node has a better index than the current reward).
Need for a Forecasting Model. Exploring greedily based on Gittins indexes remains optimal as long as: (i) the reward distribution is known in advance and (ii) the distribution of different boxes are independent (the reward observed for one box does not affect the predicted reward on others). When this is not the case, using Gittins indices may not be optimal, but generally remains a strong policy [
54].
In general, however, we do not know
a priori the distribution of possible rewards that we may obtain. For example, we cannot know in advance that thinking for 1000 or 10,000 tokens will lead to finding a correct solution. In order to efficiently optimize the net reward
J, an agent also has to learn to
forecast both the cost of an exploration attempt and the probability of it improving over the current best solution [
18].
7. Discussion and Related Prior Work
The subject matter of this paper falls within the scope of statistical machine learning. Traditionally, one starts with instantiated data that implicitly define the task and arrives at a model that performs the same inference computation on all future instances of the same task. However, once trained models are used as generative distributions, as is customary in generative AI, they exhibit behaviors that were not explicitly encoded in the training data nor the chosen loss. A key such behavior is the ability to perform variable-length inference computation, which increasingly often leads to solving previously unseen tasks. In classical (inductive) machine learning, there is no feedback mechanism at inference time, so one can only evaluate the quality of a model post-hoc, typically on data other than the data at hand. Agents, on the other hand, interact with the environment, which provides feedback, and/or can call tools to solicit feedback or spawn simulations at inference times prior to rendering an outcome. Inference computation can therefore adapt depending on the resulting feedback. This mode of interaction calls for a different approach to learning, which aims to empower transductive inference. The power of LLMs stems from the fact that, despite being trained inductively (despite the name, so-called unsupervised pre-training is simply next-token prediction, a standard multi-class supervised classification problem), they operate transductively, leveraging their chain-of-thought. In this paper, we explore the foundational principles of such transductive learning, and its limits, including bounds and power laws.
Transduction, In-context Learning, and Solomonoff Inference. Transduction in the form of learning jointly from labeled training examples and unlabeled test samples was championed by Vapnik [
1,
55,
56].
The dichotomy we present between transductive and inductive inference has long been considered in machine learning under various guises, such as the distinction between “learning to generalize” and “learning to reason”, between System 1 and System 2 [
53,
57], between
fluid intelligence and
crystallized intelligence [
58], or between learners and solvers [
57]. Our objective is not to further justify the need for transduction or reasoning, which has amply been done, but rather to contextualize it through the lens of LLMs as computational engines and provide scaling laws connecting time-to-solve with the learning of algorithmic information.
An early observation was that language models can learn multiple tasks implicitly through the unsupervised language modelling objective [
59] and exhibit diverse behaviors when adequately prompted. In-context learning [
60], which is a form of transductive inference, introduces demonstration examples into the model context to elicit desired behavior. It has been shown that LLMs can perform optimization algorithms such as gradient descent and and ridge regression transductively at inference time from in-context examples [
61]. Ref. [
62] demonstrates that sparse linear functions, decision trees, and two-layer networks can be learned in-context. Ref. [
63] investigates what minimal pretraining is necessary to induce in-context learning, showing that a small pretrained model can achieve close to the Bayes optimal algorithm. Ref. [
64] introduces a theory of in-context learning where a hypothesis is formed at inference time, obtaining generalization bounds. Ref. [
65] demonstrates that a single self-attention layer trained by gradient flow to perform in-context learning converges to a global minimum that is competitive with the best predictor on the test distribution. The connection between in-context learning and Solomonoff inference was identified in [
66], where the authors attempted to learn the Solomonoff semimeasure directly by sampling programs and training on their outputs. In [
67], motivated by the inductive theory of Solomonoff [
7,
8,
68], the authors demonstrate that LLMs can outperform general purpose compressors, even for audio and visual data. The connection between Solomonoff induction and neural network optimization as a form of program search was mentioned earlier in [
69].
As we noted in the introduction, time plays no role in Solomonoff inference, nor in in-context learning. Neither involve actual “learning” in the classical inductive sense: The weights are fixed and the same task, presented in-context multiple times, requires repeating the same effort to yield no different outcome each time. However, time plays a key role in learning transduction, which is the core motivation of this work.
Systematicity of Dichotomies. Several conceptual pairs recur throughout this work (
Table 1), which can, among other interpretations, be seen as projections of the distinction between inductive and transductive inference onto different aspects of a learning system.
In inductive inference, the goal is to compress past observations into a fixed map that generalizes to similar future inputs. The bottleneck is representational capacity (space/parameters); success is measured by accuracy on held-out data; simplicity aids generalization; and memorization of training-specific detail is harmful. Computation is a single feedforward pass (System 1), interpolating among previously seen patterns. Space-pressure (amount of information stored) is important to learn generalizable features rather than memorizing past results. A prototypical task is language modeling: reducing uncertainty on the next token given the statistical structure of past text.
In transductive inference, the goal is to reason about a specific instance at inference time, guided by a verifier or reward signal. The bottleneck is computational budget (time/tokens); success is measured by speed to a verified solution; and memorization of reusable algorithmic structure is beneficial (Theorem 11). Computation is variable-length chain-of-thought (System 2), extrapolating beyond the training distribution to solve novel tasks. Time pressure is essential for learning to occur (Theorem 4) and avoid savantry. A prototypical task is theorem proving: search for a proof that satisfies a checker.
The two regimes are not mutually exclusive. Automatization (
Appendix A) moves frequent reasoning patterns from the transductive to the inductive regime, trading space for time. Conversely, when the environment changes, previously automatized behaviors must be re-derived transductively. A complete agent must balance both, and the optimal operating point depends on the stability of the environment and the cost of computing.
LRMs, SLMs, VLMs, VLAs, World Models, etc. (nomenclature). The term LLM originally referred to large-scale Transformer-based models (pre-)trained as next-token predictors using large-scale corpora of natural language, then co-opted as probability distributions to sample new natural language text. Optionally, these models could be fine-tuned by scoring such generated expressions using human preference, an external reward mechanism, or the model itself through self-assessment. It is also common to call the outcome of the same process a ‘World Model’ (WM) if trained on sensory data such as video or audio instead of natural language, a ‘vision-language model’ (VLM) if trained on both, or a ‘vision-language-action’ model if the output expression is used to issue commands to an actuator, or ‘large reasoning models’ (LRMs) if they are used to generate variable-length trajectories prior to rendering the decision or action. In our nomenclature, any large-scale predictor trained on sequential data with latent logical/linguistic structure (with objects, relations, functions, etc.) develops an inner state space with an internal “Neuralese language” [
70]. Sensory data are replete with latent discrete entities [
71], their relations (topological, geometric, dynamic, semantic), and (de)composition into parts (meronomies) or abstract concepts (taxonomies). In our definition of LLM, therefore, where ‘language’ is not restricted to natural language and instead refers to any form of Neuralese, VLMs, WMs, VLAs, LRMs, and other variants are also LLMs. We also include in the term LLM models that use different architectures, so long as they have a ‘state’ (memory), whether explicit (as in state-space models) or implicit by co-opting a sliding window of data, as in autoregressive Transformers [
72]. Since the largest LLMs at this point comprise trillions of parameters, some now refer to models with merely billions of parameters as ‘small language models’ or SLMs. Obviously, ‘small’ is subjective, and these models have no architectural, structural, functional, or conceptual difference from their ‘large’ counterparts, so they too are just LLMs. Empirically, some emergent phenomena are only observed at scale, but this does not mean that there is a clear divider between ‘large’ and ‘small’, even phenomenologically since smaller models can still be distilled from larger ones and maintain their behavior even if it would not have emerged from cold-start using the same training protocol [
50].
Embodied AI. The results described in this paper pertain to both software agents that exist within the world of bits and embodied agents that interact with the physical environment. While this may seem counter to the forced dichotomy between LLMs and so-called World Models, once the sensory data is tokenized, the two become the same mathematically. Regardless of how a model is trained inductively, once it acquires the ability to perform transductive inference, it can act as an agent. This could be in the world of bits, where interaction with the surrounding environment is through APIs and function calls, or in the world of atoms, where sensors provide measurements that the agent turns into a representation of the environment (which is an abstract concept finitely encoded in the state of the agent [
73]) and operate on it (i.e., reason) to command physical actuators that affect the the environment. Such an environment then provides a feedback signal, ultimately in the form of “verification” (e.g., survival or rewards). The reasoning agents exist in a finitely encoded world and interface with the physical world through encoders and decoders. The core of all these agents is the ability to perform transductive inference within the discrete/discretized representation, which requires computation as described in this document. While evolution proves that processing sensory data is sufficient to foster the emergence of reasoning, language is already conveniently distilled (symbolized and compressed), making the traversal of the evolutionary path unnecessary for the emergence of reasoning. In this sense, agentic AI subsumes embodied AI, where the latter focuses on the source of the data (sampled physical sensory measurements) and focuses on physical actuators commands (physical action).
Universal Computation, Universal Search, Algorithmic Information. Theoretical Computer Science has devoted decades to the development of universal algorithms; indeed, Levin’s paper that introduced his universal search algorithm seeded a large portion of the subsequent literature on computational complexity theory. Since we only use the concepts driving universal search, we do not review this sizable body of work here and refer the reader to any textbook on complexity theory. One exception we make is to comment on the literature of Kolmogorov Complexity and Algorithmic Information Theory, which is thoroughly covered in textbooks [
5].
While Kolmogorov’s theory is useful for certain asymptotic analysis, and we make heavy use of it in this manuscript, it is worth pointing out that it fails its original intent to canonically separate “information” from “randomness’ with the concept of the Structure Function. In our approach, the information in a trained model is specific to the model and, hence, subjective.
Reinforcement Learning. Our formulation—searching for given a score function f—superficially resembles reinforcement learning (RL), where an agent maximizes cumulative reward through interaction with an environment. However, the goals differ in a key way. Standard RL seeks to learn a fixed policy for a particular environment. We instead want a universal solver to craft a new policy for tasks no agent has previously encountered. Put differently, a particular RL problem is one possible task f that our universal solver could be asked to address, but the solver itself is not specialized to any single such task. While RL considers time-penalized objectives, the time RL tries to minimize is the number of actions or observations, while we try to minimize the computing time used by the agent to decide which actions to take or to process the observation. The theoretical questions we focus on—how learning compresses the multiplicative constant of universal search and how scaling laws govern the resulting speed-up—are orthogonal to the design of any particular RL algorithm, which is why we do not review that extensive literature here. As an anonymous reviewer pointed out, the term “agent” has traditionally referred to algorithms whose every action affects the state of the world, whereas algorithms that perform computation without affecting the world ought to be called “reasoners.” However, modern AI agents can, and do, perform actions such as retrieving data, executing code, or running a verifier that do not change the state of the world (thus fitting within the search paradigm), yet nonetheless represent actions triggered by the chain of thought, which justifies the customary use of the term “agent.”
Markov Chains. We use general dynamical systems as a model of computation, but arguably the most important piece is the transition probability , which defines a Markov Chain. One may wonder why we need to introduce such general machinery, including ‘proper time,’ when standard concepts from Markov Chains, such as the expected hitting time, would suffice. As noted, however, hitting time could be made arbitrarily small or large without fundamentally changing the computations performed. We also note that since AI agents interact with the unknown environment, they are not closed systems describable with a Markov chain, but can still be described as a dynamical system. We also note that our use of dynamical systems or Markov structure is restricted to modeling the computations of the agent solving the task, not on the data itself. In fact, we stay clear of ever making any assumption on the data generating distribution aside from it satisfying the Hilberg’s law, which defies the Markov assumption. We also note that, while ultimately one could argue that any physical process has a latent finite-memory or Markov structure, there is a fundamental difference between a Markov process of a known order and one of a finite but unknown order. In the former case, one can just instantiate a model with sufficient capacity and know that it will learn the statistics of all order. In the latter case, one can never know when they have observed enough, and must instead be ready to add to memory—effectively treating the process as non-Markov. The theory described in this manuscript pertains to the latter case.
Memorization and Generalization. Information complexity-based generalization theory formalizes the notion that generalization occurs whenever the information the learned hypothesis contains about the training data is minimal (low memorization). Ref. [
74] demonstrated that a single information exponential inequality is sufficient to derive PAC Bayes bounds [
9], the mutual information bound [
75], and a version of the conditional mutual information bound [
76]. Even the classical finite-hypothesis and VC dimension bound [
77] can be viewed as primitive versions of such bounds. All the aforementioned results assume that the training and test data are drawn as i.i.d. samples from a common distribution. In contrast with the aforementioned theory, it has been demonstrated that there are learning tasks where memorization is provably necessary to achieve high accuracy [
78], and that mitigating memorization can cause the model to fail on long-tailed tasks [
79,
80]. There is evidence that natural language is akin to such long-tailed tasks that require memorization. Recent work demonstrates that there are models for language that explicitly memorize the training data that perform well. Ref. [
81] introduces nearest-neighbor language models (kNN-LM) that predict the next token according to the
k nearest neighbors of the context embedding, which requires explicitly encoding all context embeddings and their subsequent tokens. Ref. [
82] demonstrates that augmenting parameteric LLMs with a kNN-LM can significantly boost performance. Even stronger, ref. [
83] demonstrates that a generalization of an
n-gram model (dubbed
gram) outperforms kNN-LM, while losslessly encoding the training data into a suffix array data structure.
Additional Related Work. Ref. [
84] championed the analysis of inductive learning from the lens of algorithmic complexity. Ref. [
85] provides a comprehensive view of the compression approach to learning algorithmic information, complementary to ours.
Ref. [
86] develops reinforcement learning algorithms with insights from algorithmic information theory. Ref. [
87] uses Levin search to find
low-complexity solutions to a classification task, whereas we use Levin search to define a notion of time for universal task solvers. Ref. [
41] notes that solutions generally share zero algorithmic information and highlights the importance of learning for decreasing time. They also use an adaptive Levin search with information from the past, although the final solution bears no similarity to our approach. Ref. [
42] applies ideas from active learning to general problem solving.
Finally, the study of dynamical systems as a universal computer was developed extensively in the context of hybrid (continuous/discrete) systems, as exemplified in [
88]. Although the type of dynamical systems we describe here evolve in discrete space and time, some of the themes are recurrent and additional insights may be gleaned from revisiting that literature.