1. Introduction
A number of recent studies has pointed out mathematical equivalences between thermodynamic systems described by statistical mechanics and information processing systems [
1,
2,
3,
4]. In particular, it has been suggested that decisionmakers with constrained informationprocessing resources can be described in analogy to closed physical systems in contact with a heat bath that seek to minimize energy [
1]. In this analogy, decisionmakers can be thought to act in a way that minimizes a cost function or, equivalently, that maximizes a utility function in lieu of an energy function. Classic decision theory [
5,
6] states that, given a set of actions
$\mathcal{X}$ and a set of observations
$\mathcal{O}$, the perfectly rational decisionmaker should choose the best possible action
${x}^{*}\in \mathcal{X}$ that maximizes the expected utility
$U(x)$:
where
$p(ox)$ is the probability of the outcome
o given action
x and
$V(o)$ indicates the utility of this outcome. However, maximizing the expected utility is in general a costly computational operation that real decisionmakers might not be able to perform.
Decisionmakers that are unable to choose the best possible action
${x}^{*}$ due to a lack of computational resources have traditionally been studied in the field of bounded rationality. Originally proposed by Herbert Simon [
7,
8], bounded rationality comprises a medley of approaches ranging from optimizationbased approaches like bounded optimality (searching for the program that achieves the best utility performance on a particular platform) [
9,
10,
11] and metareasoning (optimizing the cost of reasoning) [
12,
13,
14] to heuristic approaches that reject the notion of optimization [
15,
16,
17]. Recently, new impulses for the development of bounded rationality theory have come from informationtheoretic and thermodynamic perspectives on the general organization of perceptionactionsystems [
1,
3,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27]. In the economic and gametheoretic literature, these models have precursors that have studied bounded rationality inspired by stochastic choice rules originally proposed by Luce, McFadden and others [
2,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39]. In most of these models, decisionmakers face a tradeoff between the attainment of maximum utility and the required informationprocessing cost measured as an entropy or relative entropy. The optimal solution to this tradeoff usually takes the form of a Boltzmannlike distribution analogous to equilibrium distributions in statistical physics. The decisionmaking process can then be conceptualized as a change from a prior strategy distribution to a posterior strategy distribution, where the change is triggered by a change in the utility landscape. However, studying changes in equilibrium distributions neglects not only the time required for this change, but also the adaptation process itself.
The main contribution of this paper is to show that the analogy between equilibrium thermodynamics and boundedrational decisionmaking [
1] can be extended to the nonequilibrium domain under the assumption that the temporal evolution of the utility function is externally driven and does not depend on the decisionmaker’s action. This allows for new predictions that can be tested in experimental setups investigating decisionmakers that choose between multiple alternatives. When given sufficient time to adjust to the problem such a decisionmaker may achieve a bounded optimal performance given the available precision, which may be described by an equilibrium distribution; for example, a dart thrower that has fully adapted her/his personal best performance after extensive training with prism glasses. However, if given insufficient time, the decisionmaker may not achieve bounded optimal performance, but only an inferior performance biased by the specific informationprocessing mechanisms used by the decisionmaker, which may in general be described by a nonequilibrium distribution; for example, a dart thrower that is wearing prism glasses for the first time and plays according to a nonadaptive strategy thereby “dissipating” utility. The connection between the nonequilibrium and equilibrium domains is tied with the concept of dissipation and its role in fluctuation theorems, which are important recent results in nonequilibrium thermodynamics.
The paper is organized as follows. In
Section 2, we recapitulate the relation between bounded rational decisionmaking and equilibrium thermodynamics. In
Section 3, we relate decisionmaking processes to nonequilibrium thermodynamics. In
Section 4, we generalize concepts from nonequilibrium thermodynamics to make them applicable to a wider range of decisionmaking problems. In particular, we include a derivation of a generalized Jarzynski equality and a generalized Crooks’ theorem for decisionmaking. We provide simulations to illustrate the new relations in different decisionmaking scenarios. In
Section 5, we discuss our results.
2. Equilibrium Thermodynamics and DecisionMaking
In thermodynamics, closed physical systems in thermal equilibrium with their environment are described by equilibrium distributions that do not change over time. For example, a gas in a box distributes its particles evenly over the entire space and will stay this way and not spontaneously concentrate in a corner of the box. When changing constraints of the physical system, equilibrium thermodynamics allows predicting the final state after the change has taken place. For example, when opening a divider between two boxes, the gas will expand further until it fills the entire space evenly. This way, equilibrium thermodynamics allows describing system behaviour as a change from a prior equilibrium distribution to a posterior equilibrium distribution triggered by a change in external constraints.
On an abstract level, one can think about changes in the distribution of a random variable from a prior to a posterior distribution as the basis of informationprocessing. In Bayesian inference, for example, we update current prior beliefs
${p}_{0}(x)$ by means of a likelihood to obtain a posterior belief
${p}_{1}(x)$. Similarly, decisionmaking can be regarded as a process of changing a prior strategy
${p}_{0}(x)$ to a posterior strategy
${p}_{1}(x)$ through a process of deliberation [
1], thereby emphasizing the stochastic nature of choice [
40]. According to [
1], such transitions from prior to posterior with information constraints can be formalized by optimizing the variational problem:
where:
is a free energy functional,
$\Delta U(x)$ is a change in utility (analogous to the notion of gains and losses in prospect theory [
15]),
${D}_{\mathrm{KL}}(\xb7\Vert \xb7)$ is the Kullback–Leibler divergence or relative entropy and
$\beta $ is a realvalued parameter that translates from informational units into utility units. Accordingly, Equation (
3) optimizes a tradeoff between utility gains and informationprocessing resources quantified by the “information distance” between prior and posterior. In a physical system (where the energy function corresponds to a negative utility), Equation (
3) evaluated at the optimum
${p}_{1}^{\mathrm{eq}}$ quantifies the negative free energy difference
$\Delta F[{p}_{1}^{\mathrm{eq}}]$ between the final state 1 and the initial state 0 assuming an isothermal process with respect to the inverse temperature
$\beta $ and a negative energy difference of
$\Delta U={U}_{1}{U}_{0}$.
For a given information cost parameter
$\beta $, the bounded rational decisionmaker optimally trades off utility gain against informational resources according to Equation (
2), thereby following the strategy:
with partition function
${Z}_{\beta}={\sum}_{x}{p}_{0}(x){e}^{\beta \Delta U(x)}$. When inserting the optimal strategy
${p}_{1}^{\mathrm{eq}}(x)$ into Equation (
3), the certaintyequivalent value of strategy
${p}_{1}^{\mathrm{eq}}$ is determined by
For
$\beta \to 0$, the cost of computation dominates, and the optimal strategy is given by the prior strategy
${p}_{1}^{\mathrm{eq}}(x)={p}_{0}(x)$ with the value
${\mathrm{lim}}_{\beta \to 0}\Delta F[{p}_{1}^{\mathrm{eq}}]={\u2329\Delta U(x)\u232a}_{{p}_{0}(x)}$. This models a decisionmaker that cannot afford any informationprocessing. When information costs are low (
$\beta \to \infty $), the optimal strategy
${p}_{1}^{\mathrm{eq}}(x)$ places all the probability mass on the maximum of
$\Delta U(x)$, and the value of the strategy is
${\mathrm{lim}}_{\beta \to \infty}\Delta F[{p}_{1}^{\mathrm{eq}}]={\mathrm{max}}_{x}\Delta U(x)$. This models a perfectly rational decisionmaker that can hand pick the best action. While this model includes maximum (expected) utility decisionmaking of Equation (
1) as a special case, note that conceptually, the formulation of the decision problem as a variational problem in the probability distribution is very different from traditional approaches that define an optimization problem directly in the space of actions.
One possible objection to the strategy (
4) is that it requires computing the partition sum
${Z}_{\beta}$ over all possible actions, which is in general an intractable operation; even though Equation (
4) could still be of descriptive value. It should be noted, however, that the decisionmaker is not required to explicitly compute
${p}_{1}^{\mathrm{eq}}(x)$; it suffices to produce a sample from
${p}_{1}^{\mathrm{eq}}(x)$ to generate a decision. This can be achieved, for example, by Markov Chain Monte Carlo (MCMC) methods that are specifically designed to avoid the explicit computation of partition sums [
41]. In the following, we recapitulate two simple MCMC examples in the context of decisionmaking: a bounded rational decisionmaker that uses a rejection sampling scheme and a bounded rational decisionmaker that uses a variant of the Metropolis–Hastings scheme [
42].
Exemplary Bounded Rational DecisionMakers
The optimal distribution (
4) can be implemented, for example, by a decisionmaker that follows a probabilistic satisficing strategy with aspiration level
$T\ge {\mathrm{max}}_{x}\Delta U(x)$. Such a decisionmaker optimizes the utility
$\Delta U(x)$ by drawing samples from the prior distribution
${x}_{s}\sim {p}_{0}(x)$ and accepts with certainty the first sample
${x}_{s}$ with utility
$\Delta U({x}_{s})\ge T$ reaching the aspiration level
T or any sample with utility below the aspiration level with acceptance probability
${p}_{\mathrm{accept}}=\mathrm{exp}(\beta (\Delta U({x}_{s})T))$. The most efficient samplers use
$T={\mathrm{max}}_{x}\Delta U(x)$. For samplers with
$T>{\mathrm{max}}_{x}\Delta U(x)$, the probability distribution (
4) is still recovered, but more samples are required, as the acceptance probability
${p}_{\mathrm{accept}}$ is decreased in this case. This strategy is a particular version of the rejection sampling algorithm and is shown in pseudocode in Algorithm 1. We can see the direct connection between informational resources (“distance away from the prior”) and the average number of samples required until acceptance, as the expected number of required samples from
${p}_{0}$ to obtain one accepted sample from
${p}_{1}^{\mathrm{eq}}$ is given by
${\overline{n}}_{\beta}=\mathrm{exp}(\beta T)/{Z}_{\beta}\ge \mathrm{exp}{D}_{\mathrm{KL}}\left(p\Vert {p}_{0}\right)$ [
43]. In the limit of zero informationprocessing with
${D}_{\mathrm{KL}}\left(p\Vert {p}_{0}\right)=0$ in the highcost regime
$\beta \to 0$, the sampling complexity tends to its minimum
${\overline{n}}_{\beta \to 0}\to 1$.
Algorithm 1 Rejection sampling. 

In case we do not want to set an absolute aspiration level T, an incremental version of such a decisionmaker can be realized by the Metropolis–Hastings scheme. Given a current action proposal x, the decisionmaker generates a novel proposal ${x}^{\prime}$ from ${p}_{0}(x)$. If $\Delta U({x}^{\prime})\ge \Delta U(x)$, then the sample is accepted with certainty. An inferior sample is accepted with probability ${p}_{\mathrm{accept}}=\mathrm{exp}(\beta (\Delta U({x}^{\prime})\Delta U(x))$. The aspiration level in this case is variable and always given by the utility of the previous sample. This corresponds to a Markov chain with transition probability $p({x}^{\prime}x)={p}_{0}({x}^{\prime})\mathrm{min}\{1,\mathrm{exp}\left(\beta \left(\Delta U({x}^{\prime})\Delta U(x)\right)\right)\}$ and stationary distribution ${p}_{1}^{\mathrm{eq}}(x)$. This Markov chain fulfils detailed balance, i.e., ${p}_{1}^{\mathrm{eq}}(x)p({x}^{\prime}x)={p}_{1}^{\mathrm{eq}}({x}^{\prime})p(x{x}^{\prime})$, which implies that after infinitely many repetitions, the samples x will follow the stationary distribution. This Markov chain is a particular version of the Metropolis–Hastings algorithm and is shown in pseudocode in Algorithm 2. The longer the chain runs, the further the distribution of x will move away from the prior, i.e., the higher the informational resources will be. Finally, the chain reaches the equilibrium distribution.
Algorithm 2 Metropolis–Hastings sampling. 
$x\sim {p}_{0}(x)$ repeat ${x}^{\prime}\sim {p}_{0}({x}^{\prime})$ $u\sim \mathrm{Uniform}[0,1]$ if $u\le \mathrm{exp}(\beta (\Delta U({x}^{\prime})\Delta U(x)))$ then accept $x\leftarrow {x}^{\prime}$ until chain has converged to equilibrium return x

3. NonEquilibrium Thermodynamics and DecisionMaking
If decisionmaking is emulated by a Markov chain that converges to an equilibrium distribution and one wants to be absolutely certain that the chain has reached equilibrium, then one has to wait for an infinitely long time. For finite times, when considering only a limited number of samples from the chain, we are dealing in general with nonequilibrium any time process models, i.e., computational processes that can be interrupted at any time to deliver an answer; a representative example being the Metropolis–Hastings dynamics when Algorithm 2 is run for
$k\in \mathbb{N}$ steps. The same holds true for a rejection sampling decisionmaker. Even though Algorithm 1 generates equilibrium samples with a finite expected number of samples
${\overline{n}}_{\beta}$, before running the algorithm, it is unknown whether after a particular number of steps
k, a sample will be accepted or not; to have certainty, we would have to allow for an infinite amount of time (
$k\to \infty )$. In an any time version of rejection sampling, the probability of not accepting a sample after
k tries is given by
${q}_{k}={\left[1Z(\beta )\mathrm{exp}(\beta T)\right]}^{k}$, in which case the sample
${x}_{s}$ will be distributed according to the prior distribution
${p}_{0}(x)$. The probability of accepting a sample that is distributed according to
${p}_{1}^{\mathrm{eq}}(x)$ after
k tries is given by
$1{q}_{k}$. Accordingly, the action at time
k is a mixture distribution of the form:
The distribution ${p}_{k}^{\mathrm{neq}}(x)$ is a nonequilibrium distribution that reaches equilibrium ${p}_{k}^{\mathrm{neq}}(x)\to \phantom{\rule{3.33333pt}{0ex}}{p}_{1}^{\mathrm{eq}}(x)$ for $k\to \infty $. In the following, we ask how far the tools of nonequilibrium thermodynamics are applicable to such any time decisionmaking processes.
3.1. NonEquilibrium Thermodynamics
In thermodynamics, nonequilibrium processes are often modelled in the presence of an external parameter
$\lambda (t)\in [0,1]$ that determines how the energy function
${E}_{\lambda}(x)$ changes over time; for example, when switching on a potential in a linear fashion, the energy would be
${E}_{\lambda}(x)={E}_{0}(x)+\phantom{\rule{3.33333pt}{0ex}}\lambda \phantom{\rule{3.33333pt}{0ex}}({E}_{1}(x){E}_{0}(x))$. When the change in the parameter
$\lambda $ is done infinitely slowly (quasistatically), the system’s probability distribution follows exactly the path of equilibrium distributions (for any
$\lambda $)
${p}_{\lambda}(x)=\frac{1}{{Z}_{\lambda}}{e}^{\beta {E}_{\lambda}(x)}$. Importantly, when the switching of the external parameter
$\lambda $ is done in finite time, the trajectory in phase space of the evolving thermodynamic system can potentially be very different from the quasistatic case. In particular, the nonequilibrium path of probability distributions is going to be, in general, different from the equilibrium path. We define the trajectory of an evolving system as a finite sequence of states
$\mathbf{x}:=({x}_{0},{x}_{1},\cdots {x}_{N})$ at times
${t}_{0},{t}_{1},\dots ,{t}_{N}$, and the probability of the trajectory as
$p(\mathbf{x}):=p({x}_{0}{t}_{0}){\prod}_{n=1}^{N}p({x}_{n}{x}_{n1},{t}_{n})$ that follows Markovian dynamics. Since
$\lambda $ is then a function of time
$\lambda ({t}_{n})$, we can effectively consider the energy as a function of state and time
$E({x}_{n},{t}_{n}):={E}_{\lambda ({t}_{n})}({x}_{n})$. Accordingly, the internal energy of the system can change in two ways depending on changes in the two variables
${t}_{n}$ and
${x}_{n}$. Assuming discrete time steps, an energy change due to a change in the external parameter is defined as the work [
24,
44]:
and an energy change due to an internal state change is defined as the heat [
24,
44]:
For an entire process trajectory
${x}_{0},{x}_{1},\dots ,{x}_{N}$ measured at times
${t}_{0},{t}_{1},\dots ,{t}_{N}$, the extracted work is
$W(\mathbf{x})={\sum}_{n=1}^{N}w({x}_{n1},{t}_{n1}\to {t}_{n})$, and the heat transferred to the environment by relaxation steps is
$Q(\mathbf{x})={\sum}_{n=1}^{N}q({x}_{n1}\to {x}_{n},{t}_{n})$. The sum of work and heat is the total energy difference
$\Delta E(\mathbf{x}):=(E({x}_{N},{t}_{N})E({x}_{0},{t}_{0}))=W(\mathbf{x})+Q(\mathbf{x})$. In expectation with respect to
$p(\mathbf{x})$, we define the average work
$W:={\u2329W(\mathbf{x})\u232a}_{p(\mathbf{x})}$, the average heat
$Q:={\u2329Q(\mathbf{x})\u232a}_{p(\mathbf{x})}$ and the average energy change
$\Delta E:={\u2329\Delta E(\mathbf{x})\u232a}_{p(\mathbf{x})}$. With these averaged quantities, we obtain the first law of thermodynamics in its usual form:
The heat
Q can be decomposed into a reversible and an irreversible part given by the entropy difference
$\Delta S=(S({t}_{N})S({t}_{0}))$, which is multiplied by the temperature
T and the average dissipation
${W}^{\mathrm{diss}}$. The concept of dissipation will be particularly useful later to quantify inefficacies in decisionmaking processes with limited time. By identifying the equilibrium free energy difference with
$\Delta F:=(F({t}_{N})F({t}_{0}))=\Delta ET\Delta S$, we can then write the first law as:
In case of a quasistatic process, the extracted work
W exactly coincides with the equilibrium free energy difference (thus,
${W}^{\mathrm{diss}}=0$). In the case of a finite time process, we can express the average dissipated work as [
45,
46,
47]:
where
${D}_{\mathrm{KL}}$ is the relative entropy that measures in bits the distinguishability between the probability of the forward in time trajectory
$p(\mathbf{x})$ and the probability of the backward in time trajectory
${p}^{\u2020}(\mathbf{x}):=p({x}_{N}{t}_{N}){\prod}_{n=1}^{N}p({x}_{n1}{x}_{n},{t}_{n1})$. From the positivity of the relative entropy, we can immediately see the nonnegativity of entropy production
${W}^{\mathrm{diss}}\ge 0$, which allows stating the second law of thermodynamics in the form:
3.1.1. Crooks’ Fluctuation Theorem
Equation (
9) can be given in a more general form without averages. It is possible to relate the reversibility of a process with its dissipation at the trajectory level. Given a protocol
$\mathrm{\Lambda}=({\lambda}_{0},{\lambda}_{1},\cdots {\lambda}_{N})$, i.e., a sequence of external parameters, the probability
$p(\mathbf{x})$ of observing a trajectory of the system in phase space compared with its timereversal conjugate
${p}^{\u2020}(\mathbf{x})$ (when using the timereversal protocol
${\mathrm{\Lambda}}^{\u2020}=({\lambda}_{N},{\lambda}_{N1},\cdots {\lambda}_{0})$) depends on the dissipation of the trajectory in the forward direction according to the following expression:
where
${W}^{\mathrm{diss}}(\mathbf{x})=\Delta FW(\mathbf{x})$ is the dissipated work of the trajectory. For this relation to be true, both backward and forward processes must start with the system in equilibrium. Intuitively, this means that the more the entropy production (measured by the dissipated work), the more distinguishable are the trajectories of the forward protocol compared to the backward protocol.
3.1.2. Jarzynski Equality
Additionally, another relation of interest in nonequilibrium thermodynamics has recently been found transforming the inequality of Equation (
10) into an equality, the socalled Jarzynski equality [
48]:
where the angle brackets denote an average over all possible trajectories
$\mathbf{x}$ of a process that drives the system from an equilibrium state at
$\lambda =0$ to another state at
$\lambda =1$. Specifically, the above equality says that, no matter how the driving process is implemented, we can determine equilibrium quantities from work fluctuations in the nonequilibrium process; or in other words, this equality connects nonequilibrium thermodynamics with equilibrium thermodynamics. In the following, we are interested in the question whether there exist similar relations such as the Jarzynski equality or Crooks’ fluctuation theorem and similar underlying concepts such as dissipation and time reversibility for the case of decisionmaking.
3.2. NonEquilibrium Thermodynamics Applied to Bounded Rational DecisionMaking
In direct analogy to the previous section, in the following, we consider decisionmakers faced with the problem of optimizing a changing utility function. We assume that time is discretized into N steps ${t}_{0},\dots ,{t}_{N}$. For each time step ${t}_{n}$, the utility is assumed to be constant, but it can change between time steps, such that we have a sequence of decision problems expressed by the changes in utility $\Delta U(x,{t}_{0}\to {t}_{1}),\cdots ,\Delta U(x,{t}_{N1}\to {t}_{N})$. At each time point ${t}_{n}$, the decisionmaker chooses action ${x}_{n}$, such that we can summarize the decisionmaker’s choices by a vector $\mathbf{x}:=({x}_{0},\dots ,{x}_{N})$. The behaviour of the decisionmaker is characterized by the probability $p(\mathbf{x}):=p({x}_{0}{t}_{0}){\prod}_{n=1}^{N}p({x}_{n}{x}_{n1},{t}_{n})$ with $p({x}_{0}{t}_{0})={p}_{0}({x}_{0})$, assuming that the initial strategy is a bounded rational equilibrium strategy. In this setup, we assume that the changes in the utility function are externally driven, i.e., the decisionmaker’s actions cannot change the temporal evolution of the utility function. Furthermore, note that the decisionmaker does not know how the utility changes over time. Accordingly, the best the decisionmaker can do is to optimize the current utility as much as possible.
At time
${t}_{0}$, the decisionmaker starts with selecting an action
${x}_{0}$ from the distribution
$p({x}_{0}{t}_{0})$ and the utility changes instantly by
$\Delta U(x,{t}_{0}\to {t}_{1})$. The decisionmaker can then adapt to this utility change with the distribution
$p({x}_{1}{x}_{0},{t}_{1})$ and select the action
${x}_{1}$ at time
${t}_{1}$, but at this point, the utility is already changing again by
$\Delta U(x,{t}_{1}\to {t}_{2})$. The adaptation from
$p({x}_{0}{t}_{0})$ to
$p({x}_{1}{x}_{0},{t}_{1})$ is analogous to a physical relaxation process and implies a strategy change between
${x}_{0}$ and
${x}_{1}$. In general, at each time point
${t}_{n1}$, the decisionmaker chooses action
${x}_{n1}$ while the current utility changes by:
This way, the decisionmaker is always lagging behind the changes in utility, just like a physical system would lag behind the changes in the energy function. The utility $\Delta U({x}_{n1},{t}_{n1}\to {t}_{n})$ gained by the decisionmaker at time point ${t}_{n1}$ parallels the concept of work in physics. For a whole trajectory, we define the total utility gain due to changes in the environment as $\mathcal{U}(\mathbf{x})={\sum}_{n=1}^{N}\Delta U({x}_{n1},{t}_{n1}\to {t}_{n})$. Note that the last decision ${x}_{N}$ can be ignored in this notation, as it does not contribute to the utility.
In
Figure 1 (left column), we illustrate the setup for a onestep decision problem
$\Delta U(x,{t}_{0}\to {t}_{1})$ with behaviour vector
$\mathbf{x}=({x}_{0},{x}_{1})$. An instantaneous change in the environment occurs at time
${t}_{0}$ represented by a vertical jump from
${\lambda}_{0}$ to
${\lambda}_{1}$ in the upper panels that translates directly into a change in free energy difference represented by
$\Delta F$ in the lower panels. The system’s previous state at
${t}_{0}$ is given by
${p}_{0}^{\mathrm{eq}}(x)$, i.e., the equilibrium distribution for
${U}_{0}$. The new equilibrium is given by
${p}_{1}^{\mathrm{eq}}(x)$, i.e., the equilibrium distribution for
${U}_{1}$. In this case, the behaviour vector is
$\mathbf{x}=({x}_{0},{x}_{1})$ with
${x}_{0}\sim {p}_{0}^{\mathrm{eq}}(x)$, and
${x}_{1}$ is ignored.
Similarly to Equation (
8), we can now formulate the first law for decisionmaking as:
stating that the total average utility
$\mathcal{U}:={\u2329\mathcal{U}(\mathbf{x})\u232a}_{p(\mathbf{x})}$ is the difference between the bounded optimal utility (following the equilibrium strategy with precision
$\beta $) expressed by the equilibrium free energy difference
$\Delta F$ and the dissipated utility
${\mathcal{U}}^{\mathrm{diss}}$. The dissipation for a trajectory
${\mathcal{U}}^{\mathrm{diss}}(\mathbf{x}):=\Delta F\mathcal{U}(\mathbf{x})$ measures the amount of utility loss due to the inability of the decisionmaker to act according to the equilibrium distribution. This is because the decisionmaker cannot anticipate the changes in the environment. At most, the decisionmaker could act according to the equilibrium distributions of the previous environment. Thus, even with full adaptation, the decisionmaker will always lag behind one time step and will therefore always dissipate.
Due to an equivalent version of Equation (
9), we can also state the second law for decisionmaking
${\mathcal{U}}^{\mathrm{diss}}\ge 0$, which implies that a purely adaptive decisionmaker can gain a maximum utility that cannot be larger than the free energy difference:
Similarly, we can obtain equivalent relationships to the Crooks fluctuation theorem:
and the Jarzynski equality:
which both have the same implications as in the physical scenario and can be derived in the same way as in the physical counterpart [
44]. In summary, we can say that an adaptive decisionmaker, which has to act without knowing that the utility function has changed, follows the same laws as a thermodynamic physical system that is lagging behind the equilibrium.
3.3. Examples
In this section, we illustrate the applicability of thermodynamic nonequilibrium concepts in a series of simulations for different decisionmaking scenarios. In particular, we study two model classes: the first one contains simple onestep lag models of adaptation where equilibrium is always reached with one time step delay, and the second one contains more complex models of adaptation that do not necessarily equilibrate after one time step. In the first model class, we can easily study the relation between dissipation and the rate of informationprocessing, whereas in the second class of models, we can study more complex nonequilibrium phenomena such as learning hysteresis.
3.3.1. OneStep Lag Models of Adaptation
Consider a learner that is adapted to their environment such that their behaviour can be described by the equilibrium distribution
${p}_{0}(x)$. For this idealized scenario, we assume that the learner can adapt their behaviour to any environment perfectly after a time lapse of
$\Delta t$. This also means that before the lapse of
$\Delta t$, the learner continues to follow their old strategy and is inefficient during this time span. We now consider two scenarios: first, where the environment changes suddenly by
$\Delta U(x)$, and second, where the environment changes slowly in
N small steps of
$\Delta U(x)/N$. In the first case, the learner is going to dissipate the utility:
in the first time step. In all subsequent time steps, no more utility is wasted, assuming the environment does not change any more. In the second case, the utility function can be written as
${U}_{t}(x)={U}_{0}(x)+\phantom{\rule{3.33333pt}{0ex}}\frac{t}{N}\Delta U(x)$ for
$t\in \mathbb{N}:0\le t\le N$. To compute the dissipated utility, we need to compare the learner’s behaviour in time step
t to the bounded optimal behaviour, which is:
for
$t>0$. The overall average dissipated utility for the whole process is then
The net utility gain for the Nstep scenario is
${\mathcal{U}}_{N}^{\mathrm{net}}=\Delta F{\mathcal{U}}_{N}^{\mathrm{diss}}$. Note that:
and consequently, in direct analogy to a quasistatic change in a thermodynamic system, we get vanishing dissipation (
${\mathcal{U}}_{N}^{\mathrm{diss}}\to 0$) if the utility changes infinitely slowly (
$N\to \infty $ and
$\Delta U(x)/N\to 0$), such that the net utility equals the free energy difference
${\mathcal{U}}_{N}^{\mathrm{net}}=\Delta F$.
3.3.2. Bayesian Inference as a OneStep Lag Process
Bayesian inference mechanisms naturally have step by step dynamics that update beliefs with new incoming observations. Again, we can consider two scenarios: first where the learner updates their belief abruptly by processing a huge chunk of data in one go, and second, where belief updates are incremental with small chunks of data at each time step. Here, we show how the size of the chunks of data affect the overall surprise of the decisionmaker and how this relates to dissipation applying the free energy principle to Bayesian inference.
Traditionally, Bayes’ rule is obtained directly from the product rule of probabilities
$p(\theta ,\mathcal{D})=p(\theta )p(\mathcal{D}\theta )=p(\mathcal{D})p(\theta \mathcal{D})$ where
$\theta $ correspond to the different available hypotheses and
$\mathcal{D}$ corresponds to the dataset. However, Bayes’ rule can also be considered to be a consequence of the maximization of the free energy difference with the loglikelihood as a utility function [
49,
50,
51]. In this view, the posterior belief
$p(\theta \mathcal{D})$ is a tradeoff between maximizing the likelihood
$p(\mathcal{D}\theta )$ and minimizing the distance from the prior
${p}_{0}(\theta )$ such that:
is identical to Bayes’ rule when
$\beta =1$. For
$\beta \to \infty $, we recover the maximum likelihood estimation method as the density update is
$p(\theta \mathcal{D})=\delta (\theta {\theta}_{\mathrm{MLE}})$ with
${\theta}_{\mathrm{MLE}}={\mathrm{argmax}}_{\theta}\mathrm{log}p(\mathcal{D}\theta )$.
Such a Bayesian learner with prior
${p}_{0}(\theta )$ that incorporates all the data
X at once is going to experience the expected surprise
$\mathcal{S}=\int {p}_{0}(\theta )\mathrm{log}p(\mathcal{D}\theta )d\theta $. In contrast, a Bayesian learner that incorporates the data slowly in
N steps (thus, the dataset
$\mathcal{D}=({X}_{1},\cdots ,{X}_{N})$ is divided in
N parts) experiences an expected surprise of
$\mathcal{S}={\sum}_{n=1}^{N}\int p(\theta {X}_{1},\cdots ,{X}_{n1})\mathrm{log}p({X}_{n}\theta )d\theta $. Here, the surprise
$\mathcal{S}$ corresponds to the thermodynamic concept of work. The first law can then be written as:
where the equivalent of dissipation corresponds to:
when processing all the data at once and to:
when processing the data in
N steps where
${X}_{<n}=({X}_{1},\cdots ,{X}_{n1})$ and
${X}_{\le n}=({X}_{1},\cdots ,{X}_{n})$. Thus, given that the equilibrium freeenergy difference
$\Delta F$ is a state function independent of the path (that means independent of whether data are processed all in one go or in small chunks), a system acquiring data slowly will have a reduced surprise
$\mathcal{S}$ and therefore have less dissipation
${\mathcal{U}}^{\mathrm{diss}}$.
In
Figure 2, we show how the number of data chunks has an effect on the overall surprise and dissipation. In particular, we have a dataset
$\mathcal{D}=({x}_{1},\cdots ,{x}_{T})$ consisting of
$T=100$ data points Gaussian distributed
$x\sim \mathcal{N}(x;{\mu}_{d}=5,{\sigma}_{d}^{2}=4)$ that we divide into batches of different sizes
$b\in \{100,50,25,20,10,5,2,1\}$. The decisionmaker has prior belief
${p}_{0}(\theta )$ about the mean
$\theta ={\mu}_{d}$ and incorporates the data of every batch of data according to Bayes’ rule until all the data are incorporated. In general, the Bayesian learner processes the data in
$T/b$ steps; for example in the case of
$b=100$, all data are processed at once (having thus high surprise), and in the case of
$b=1$, it incorporates the data in
T updates with an overall smaller surprise. In
Figure 2, we show for different batch sizes the free energy optimum
$\Delta F=\mathrm{log}\int {p}_{0}(\theta )p(\mathcal{D}\theta )$, the surprise
$\mathcal{S}$ and the dissipation
${\mathcal{U}}^{\mathrm{diss}}=\Delta F\mathcal{S}$. It can be seen that when acquiring the data in small chunks, the surprise of the decisionmaker and the dissipation are lower.
3.4. Dissipation and Learning Hysteresis
A common paradigm to study how humans learn is through adaptation tasks where subjects are exposed to changes in an environmental variable that they can counteract by changing an internal variable. Sensorimotor adaptation in humans has been extensively studied in these errorbased paradigms, for example where subjects have to adapt their hand position (internal variable) to change a virtual end effector position represented by a dot on a screen (external variable).
Consider a utility function
${U}_{v}(x)={(x{\mu}_{v})}^{2}$. For
$v=0$, we determine the prior behaviour of a decisionmaker with
${p}_{0}(x)=\frac{{e}^{\beta {U}_{0}(x)}}{Z}$. Initially, the decisionmaker obtains an average utility of
${\u2329{U}_{0}\u232a}_{{p}_{0}}$, which corresponds to zero mismatch between the decisionmaker and the environmental variable. A change of the environmental variable to
$v=1$ effectively changes the utility function to
${U}_{1}(x)={(x{\mu}_{1})}^{2}$, making
${p}_{0}$ nonoptimal. This forces the decisionmaker to reduce error adapting to the environmental variable by changing its probability distribution over his/her actions. When fully adapted to the new environment, the decisionmaker again makes no errors (other than the errors due to motor noise). We illustrate this adaptation paradigm with a decisionmaker that adapts according to the Metropolis–Hastings algorithm, which follows Markovian dynamics [
52].
Crooks Theorem and Hysteresis Effects in Adaptation Tasks
Limited adaptation capabilities not only have an effect on the amount of obtained utility through the second law for decisionmaking ${\mathcal{U}}^{\mathrm{net}}\le \Delta F$, but also induce a time asymmetry in sequential decisionmaking processes. Hysteresis loops are a typical example of this asymmetry. Hysteresis is the phenomenon in which the path followed by a system due to an external perturbation, e.g., from state A to B, is not the same as the path followed in the reverse perturbation, e.g., from state B to A. When the system follows the same path for the forward perturbation and for the reverse perturbation, we say that the process is time symmetric (and therefore, it is not subject to hysteresis effects).
In the two left panels of
Figure 3, we show a simulated trajectory of actions composed of 80 trials for an adaptation task using the Metropolis–Hastings algorithm with
$\beta =22.5$, a Gaussian proposal
$g({x}^{\prime}x)=\mathcal{N}({x}^{\prime};\mu =x,{\sigma}_{p}=0.1)$ and acceptance criterion
$\alpha ({x}^{\prime}x)=\mathrm{min}\left(\frac{{e}^{\beta U({x}^{\prime})}g(x{x}^{\prime})}{{e}^{\beta U(x)}g({x}^{\prime}x)},1\right)$, when changing the environmental variable from
${\mu}_{0}=0.0$ to
${\mu}_{1}=1.0$. In blue, we show the trajectory for the forwardintime perturbation, which converges after a few dozen trials to the new equilibrium. In brown, we show the trajectory for the reversed perturbation where the process starts with the last trial (80) and ends with the initial trial (0). In the left panel, the perturbation is made instantaneously in one step at Trial 40 and in the right panel in multiple steps (
$N=23$). The hysteresis effect is clearly seen in the instantaneous perturbation where the path of actions followed by the decisionmaker in the forward perturbation is clearly different from a typical trajectory of actions taken when applying the reversed perturbation. When the perturbation is made in multiple steps, both typical backward and typical forward trajectories become more similar denoting a smaller hysteresis effect. In this way, hysteresis effects are tightly connected to the concept of dissipation.
Dissipation and the ratio between forward and backward probabilities of trajectories of actions correspond exactly to the Crooks theorem for decisionmaking:
The probability of observing a trajectory of accepted actions
$\mathbf{x}=({x}_{0},{x}_{1},\cdots {x}_{T})$ for the Metropolis–Hastings algorithm is easily computed with
$p(\mathbf{x})=p({x}_{0}){\prod}_{t=1}^{T}g({x}_{t}{x}_{t1})\alpha ({x}_{t}{x}_{t1})$. Similarly, the probability of observing the same trajectory in the backward protocol is
$p({\mathbf{x}}^{\u2020})={p}_{\mathrm{eq}}({x}_{T}){\prod}_{t=1}^{T}g({x}_{Tt}{x}_{Tt+1})\alpha ({x}_{Tt}{x}_{Tt+1})$. The dissipated utility is
${\mathcal{U}}^{\mathrm{diss}}=\Delta F{U}_{\mathrm{tot}}$ where the free energy difference is computed between the final
${p}_{1}(x)=\frac{1}{Z}{e}^{\beta {U}_{1}(x)}$ and initial equilibrium distributions
${p}_{0}(x)=\frac{1}{Z}{e}^{\beta {U}_{0}(x)}$, and the total utility gained
${U}_{\mathrm{tot}}$ is the sum of the utilities
$\Delta U(x,{t}_{n}\to {t}_{n+1})$ at each environmental change at time
${t}_{n}$. In the third panel of
Figure 3, we show that the protocol with the instantaneous perturbation has higher dissipation (related to higher hysteresis) compared to the protocol with multiple small perturbations.
4. Generalized NonEquilibrium Thermodynamics for DecisionMaking with Deliberation
So far, we have studied decisionmakers that were forced to select an action with no opportunity to respond to a change in the utility function. This could correspond, for example, to a scenario of trialanderror learning, where the best available strategy is the prior strategy adapted to the environment before the utility changed. However, this restriction may not always be suitable. Consider for example a chess player that is shown a particular board configuration (corresponding to a change in utility) and now has a certain amount of time to decide on the next move. Similarly, consider the two introductory examples in
Section 3, where we allow a sampling algorithm to run for a certain number of steps, and then, we stop and evaluate the action after the algorithm has adapted to the new utility. In general, such deliberation processes are expensive, and we assume in the following that the Kullback–Leibler divergence is an appropriate measure of this computational expense, as outlined in the Introduction.
In the following, we consider again decisionmakers facing a sequence of decision problems expressed by the utility changes
$\Delta U(x,{t}_{0}\to {t}_{1}),\cdots ,\Delta U(x,{t}_{N1}\to {t}_{N})$. In contrast to the previous section where decisionmakers had to decide before they could adapt to the utility change, decisionmakers that deliberate select their action
${x}_{n}$ after they have (partially) adapted to the utility change:
Using this notation, we are able to summarize the decisionmaker’s choice by a vector $\mathbf{x}:=({x}_{0},\dots ,{x}_{N})$ and characterize its behaviour by the probability $p(\mathbf{x}):=p({x}_{0}{t}_{0}){\prod}_{n=1}^{N}p({x}_{n}{x}_{n1},{t}_{n})$ with $p({x}_{0}{t}_{0})={p}_{0}({x}_{0})$, assuming that the initial strategy is a bounded rational equilibrium strategy. Note that in the deliberation scenario, the initial state ${x}_{0}$ does not constitute a decision, but instead, we include the last decision ${x}_{N}$.
This setup is illustrated again in
Figure 1 (right column) for a onestep decision problem
$\Delta U(x,{t}_{0}\to {t}_{1})$ with behaviour vector
$\mathbf{x}=({x}_{0},{x}_{1})$ and with an instantaneous change in the environment occurring at time
${t}_{0}$. In the deliberation scenario, the utility is determined after the deliberation time. During deliberation, the decisionmaker has changed the strategy distribution from
${p}_{0}^{\mathrm{eq}}(x)$ to a nonequilibrium distribution
$\tilde{p}(x)$ (for example, the distribution (
6) in the rejection sampling scheme) spending in the process a certain amount of resources and achieving an average net utility of
${\mathcal{U}}^{\mathrm{net}}=\Delta F[\tilde{p}(x)]$ according to Equation (
3). In this case, the behaviour vector is
$\mathbf{x}=({x}_{0},{x}_{1})$ with
${x}_{0}$ ignored and
${x}_{1}\sim \tilde{p}(x)$. In such a scenario with a single decision problem, we define, in analogy with the previous section, the average dissipated utility as [
24,
53]:
See Appendix for a derivation of (
16) from (
9). It readily follows from the positivity of the relative entropy
${D}_{\mathrm{KL}}\left(p\Vert q\right)\ge 0$ that:
with equality when
$\tilde{p}(x)={p}_{1}^{\mathrm{eq}}(x)$. In the case of the rejection sampling decisionmaker of Equation (
6), this would correspond to an infinite amount of samples
$k\to \infty $. The inequality (
17) shows that we cannot obtain more utility than the equilibrium free energy difference.
Let us now look at the general case. In contrast to an agent without deliberation capabilities, an agent that deliberates will be able to act according to a different distribution than the prior strategy. This means that when facing the utility change
$\Delta U(x,{t}_{n1}\to {t}_{n})$ at time
${t}_{n}$, the agent chooses the action
${x}_{n}$ sampled from the posterior strategy, contrary to an agent without deliberation that chooses
${x}_{n1}$ sampled from the prior strategy. The deliberation process incurs a computational cost that is measured (in a similar fashion to stochastic thermodynamics [
54] and previous formulations of bounded rationality given in the introduction) with the difference between the conditional stochastic entropies from prior to posterior:
Note that the prior distribution $p({x}_{n}{x}_{n1},{t}_{n1})$ is the previous posterior distribution evaluated at ${x}_{n}$ instead of ${x}_{n1}$. Basically, this measures the change in probability from prior behaviour to posterior behaviour of the newly chosen action ${x}_{n}$.
Taking into account the computational cost of deliberation, we define the net utility of action
${x}_{n}$ due to a change in the environment as
which generalizes the concept of work from the previous section. The expected change in net utility is the objective function that the decisionmaker optimizes at each time step. The total net utility
${\mathcal{U}}^{\mathrm{net}}(\mathbf{x})={\sum}_{n=1}^{N}u({x}_{n},{t}_{n1}\to {t}_{n})$ takes the form of a nonequilibrium free energy:
at the trajectory level. Similarly to Equation (
8), the first law for decisionmaking with deliberation costs is:
and states that the total net utility
${\mathcal{U}}^{\mathrm{net}}={\u2329{\mathcal{U}}^{\mathrm{net}}(\mathbf{x})\u232a}_{p(\mathbf{x})}$ is the difference between the bounded optimal utility (following the equilibrium strategy with precision
$\beta $) expressed by the equilibrium free energy difference
$\Delta F$ and the dissipated utility
${\mathcal{U}}^{\mathrm{diss}}$. The dissipation:
measures the amount of utility loss if the decisionmaker’s plan does not manage to produce an action from the equilibrium distribution, for example due to the lack of time for deliberation. However, a decisionmaker with infinite deliberation time will not have this problem and therefore will not dissipate by wasting utility.
To investigate the counterpart of the second law, we need to determine whether ${\mathcal{U}}^{\mathrm{diss}}\ge 0$ holds. This can be achieved, for example, by first deriving the counterpart of the Crooks fluctuation theorem or the counterpart of the Jarzynski equation with subsequent application of Jensen’s inequality. In the following two theorems, we assume that the decisionmakers satisfy the detailed balance condition. The detailed balance condition ensures two important characteristics. First, the stochastic process reaches equilibrium, and second, it ensures timereversibility when in equilibrium. In a decisionmaking scenario, this translates into the following. First, when given enough computation time, the decisionmakers manage to sample actions from the correct equilibrium distributions. Second, ideal decisionmakers in equilibrium should not produce any entropy, which is exactly what happens if detailed balance is satisfied.
Theorem 1. Crook’sfluctuation theorem for decisionmaking with deliberation costs states that:where the dissipated utility of a particular trajectory is ${\mathcal{U}}^{\mathrm{diss}}(\mathbf{x})=\Delta F{\mathcal{U}}^{\mathrm{net}}(\mathbf{x})$ as defined in Equation (18) and the probability of the trajectory using the backward protocol is ${p}^{\u2020}(\mathbf{x})={p}^{\u2020}({x}_{0}{x}_{1},{t}_{0})$ ${p}^{\u2020}({x}_{1}{x}_{2},{t}_{1})\cdots {p}^{\u2020}({x}_{N}{t}_{N})$ for N decision problems starting at time ${t}_{N}$ and going backwards up to ${t}_{0}$. For the relation to be valid, we must assume that the starting distribution in the backward process is also in equilibrium, $p({x}_{N}{t}_{N})\propto {e}^{\beta U({x}_{N},{t}_{N})}$. Proof. Here, we derive the relationship between reversibility and dissipation.
where in the second line, we have substituted
${p}^{\u2020}({x}_{n1}{x}_{n},{t}_{n1})$ using the identity:
from detailed balance, and we assumed the initial distribution to be in equilibrium
$p({x}_{0}{t}_{0})=\frac{{e}^{\beta U({x}_{0},{t}_{0})}}{{Z}_{0}}$ and that in the backward process the decisionmaker starts also using the equilibrium strategy
${p}^{\u2020}({x}_{N}{t}_{N})=\frac{1}{{Z}_{N}}{e}^{\beta U({x}_{N},{t}_{N})}$. In the third line, we cancel out terms and apply the following two equalities
$\frac{p({x}_{n}{x}_{n1},{t}_{n})}{p({x}_{n}{x}_{n1},{t}_{n1})}={e}^{\beta \frac{1}{\beta}\mathrm{log}\frac{p({x}_{n}{x}_{n1},{t}_{n})}{p({x}_{n}{x}_{n1},{t}_{n1})}}$ and
$\Delta U({x}_{n},{t}_{n1}\to {t}_{n})=U({x}_{n},{t}_{n})U({x}_{n},{t}_{n1})$. Finally, in the last line, we employ the definition of the net utility in Equation (
18) and
$\frac{{Z}_{N}}{{Z}_{0}}={e}^{\beta \Delta F}$. ☐
Although at first sight, Equation (
20) looks the same as the previous Crooks’ relation for the nodeliberation case (
12), it is not the same. Here, the net utility is defined by Equation (
18), which takes into account both the gain in utility and the computational costs of deliberating.
Theorem 2. The Jarzynski equality for decisionmaking with deliberation costs states that: Proof.
In (
$1.$), we unfold the expression and exploit the equality
${e}^{\mathrm{log}p+\mathrm{log}q}=pq$ for the summation inside the exponential. In (
$2.$), we cancel the trajectory probabilities
${\prod}_{n=1}^{N}p({x}_{n}{t}_{n},{x}_{n1})$ and then take one term out of the two remaining products. In (
$3.$), first, we use the equivalence
$\mathrm{exp}(\beta U({x}_{1},{t}_{0}))={Z}_{0}{p}_{\mathrm{eq}}({x}_{1}{t}_{0})$ (because at time
${t}_{0}$, the decisionmaker is acting according to the equilibrium distribution) that allows us to cancel with
$p({x}_{1}{t}_{0},{x}_{0})={p}_{\mathrm{eq}}({x}_{1}{t}_{0})$, and second, we sum over
${x}_{0}$ with the only term that depends on it being
$p({x}_{0}{t}_{0})$. In (
$4.$), we take one term of the second product and perform the sum over
${x}_{1}$ to obtain by detailed balance
$\mathrm{exp}(\beta U({x}_{2},{t}_{1}))$ that will allow us to cancel with the term in the denominator of the first product. We perform Steps (
$3.$) and (
$4.$) repeatedly until obtaining the last equivalence that proves the theorem.
Again, we note that the previouslyproven Jarzynski relation from Equation (
21) is not the same equation as in the nodeliberation case (
13). In the deliberation case, the definition of the net utility is different and takes into account both the utility gain and the computational cost of deliberating.
We can now state the second law of decisionmaking with deliberation costs as:
from Equation (
20) by rearranging and taking expectations. The same inequality can be obtained from Equation (
21) by applying Jensen’s inequality
$\u2329\mathrm{exp}x\u232a\ge \mathrm{exp}\u2329x\u232a$ to recover
${\u2329{\mathcal{U}}^{\mathrm{net}}(\mathbf{x})\u232a}_{p(\mathbf{x})}\le \Delta F$. Equation (
21) connects finite with infinite time decisionmaking. That is, there is a relation between the equilibrium freeenergy differences that is the maximum attainable net utility with unlimited computation time and the net utility obtained by decisionmakers with limited computation time. In the next section, we will provide examples of how to use these relations to extract useful information from decisionmaking processes.
4.1. Examples
For the deliberation scenario, we illustrate the novel Jarzynski equality and Crooks theorem for decisionmaking in two decisionmaking scenario with clearly defined independent episodes: the first case is a discrete decisionmaking problem, and the second case is a continuous decisionmaking problem.
4.1.1. Jarzynski and Crooks Relations for Episodic DecisionMaking with Deliberation
Choicereactiontime experiments aimed to study informationprocessing in humans typically consider episodic tasks consisting of many trials; see [
55] for a recent example. Here, we take a variation of Hicks episodic task with discrete action space, commonly used in the decisionmaking literature. In our variation of Hicks task, the decisionmaker is shown a set of eight light bulbs. Initially, all light bulbs are turned off. Upon stimulus presentation, all light bulbs are turned on with different light intensities (representing different utilities) for a limited amount of time in which the decisionmaker must choose the brightest light associated with the highest utility. The choice task is repeated many times, each time with different light intensities. For simplicity, our example contains only two stimuli: compare Utility 1 and Utility 2 in
Figure 4A. When given enough time, a decisionmaker with prior
${p}_{0}(x)$ chooses its actions according to the equilibrium distribution from Equation (
4), as illustrated in
Figure 4A for the uniform prior
${p}_{0}(x)=\frac{1}{8}$ that we assume in our example. In this case, the precision
$\beta $ specifies how well the light intensities can be told apart by a bounded optimal decisionmaker.
In
Figure 4, we model a decisionmaker using the rejection sampling algorithm with the most efficient aspiration level given by the maximum utility
${\mathrm{max}}_{x}\Delta U(x)$. In particular, we simulate the rejection sampling algorithm with a limited number of samples (parameterized by
k), where the choice strategy is given by nonequilibrium probability distribution in Equation (
6) from the Introduction, because we assume that a response has to be produced within a fixed amount of time.
In this kind of episodic task, the decisionmaker always starts with the same prior
${p}_{0}(x)$ over the possible choices
x. The probability of a trajectory of decisions
$\mathbf{x}$ is defined as
$p(\mathbf{x}):={\prod}_{n=1}^{N}p({x}_{n}{t}_{n})$ for each episode
n, and the net utility for a trajectory is:
Consequently, the equilibrium free energy is defined as
$\Delta F:={\mathrm{max}}_{\tilde{p}(\mathbf{x})}{\u2329{\mathcal{U}}_{0}^{\mathrm{net}}(\mathbf{x})\u232a}_{\tilde{p}(\mathbf{x})}$, which can also be decomposed into the sum of
N independent equilibrium free energies
$\Delta F={\sum}_{n=1}^{N}{\u2329\Delta U({x}_{n},{t}_{n1}\to {t}_{n})\frac{1}{\beta}\mathrm{log}\frac{{p}^{\mathrm{eq}}({x}_{n}{t}_{n})}{{p}_{0}({x}_{n})}\u232a}_{{p}^{\mathrm{eq}}({x}_{n}{t}_{n})}$ where:
and the dissipated utility for a trajectory is
${\mathcal{U}}^{\mathrm{diss}}(\mathbf{x}):=\Delta F{\mathcal{U}}_{0}^{\mathrm{net}}(\mathbf{x})$.
We simulate trajectories with
$N=2$ by sampling repeatedly from Equation (
6). In the first panel of
Figure 4B, we show that, as expected, the more samples
k a decisionmaker can afford, the higher the average net utility
${\u2329{\mathcal{U}}_{0}^{\mathrm{net}}\u232a}_{p(\mathbf{x})}$. In the second panel, it can be seen that the equilibrium free energy difference is invariant with respect to
k and increases with higher precision
$\beta $. Lastly, in the third panel, we plot the average dissipated utility
${\u2329{\mathcal{U}}^{\mathrm{diss}}\u232a}_{p(\mathbf{x})}$ that measures how much utility is lost due to the limited number of available samples. The highest dissipation occurs for high
$\beta $ and few samples
k because such a highprecision decisionmaker can potentially obtain high utility, but the limited amount of samples restrain it. In the following, we consider both a Jarzynskilike relation and a fluctuation theorem valid for a fixed prior.
Jarzynski Equality for DecisionMaking with Fixed Prior ${p}_{0}$
For a fixed prior, it can readily be shown that the following relation is valid:
To illustrate the validity of Equation (
23), we simulated a decisionmaker that faces
T times the same two decision problems from
Figure 4A. We can estimate the lefthand side of Equation (
23) with the empirical average
$\frac{1}{T}{\sum}_{i}\mathrm{exp}(\beta {\mathcal{U}}_{0}^{\mathrm{net}}({\mathbf{x}}_{i}))$ with the
T trajectories of decisions, where
${\mathbf{x}}_{i}\sim p(\mathbf{x})$. In the top row of
Figure 4C, we show the empirical average converging to
$\mathrm{exp}(\beta \Delta F)$ (as expected by the law of large numbers) depending on the number of simulated trajectories
T and precision
$\beta $, empirically validating Equation (
23). In the bottom row, we show how the second law for decisionmaking is fulfilled as the average net utility is less than the equilibrium free energy, thus satisfying the inequality (
17).
Crooks’ Fluctuation Theorem for DecisionMaking with Fixed Prior ${p}_{0}$
For the fixed prior, it can readily be shown that the following fluctuation relation holds:
where
${p}^{\mathrm{eq}}(\mathbf{x}):={\prod}_{n=1}^{N}{p}^{\mathrm{eq}}({x}_{n}{t}_{n})$ is the optimal equilibrium distribution over trajectories
$\mathbf{x}$. Note in this case that the probability distribution of the backward process
${p}^{\u2020}(\mathbf{x})$ coincides with the optimal equilibrium distribution
${p}^{\u2020}(\mathbf{x})={p}^{\mathrm{eq}}(\mathbf{x})$ because of the independence of the decision problems. More specifically, the original Crooks theorem for decisionmaking from Equation (
20) is valid only when the backward process starts in equilibrium. In our episodic task, all decision problems are independent, which makes the starting equilibrium distributions for all the backward processes coincide with the posterior equilibrium distributions of the forward process.
The fluctuation relation (
24) for episodic tasks adopts a different meaning than the conventional relation. Specifically, the ratio between probabilities is now between the probability of observing a trajectory of actions when having finite time to make a decision (a sequence of nonequilibrium probabilities) and the probability of observing the same trajectory when having infinite time (a sequence of equilibrium probabilities). This ratio is governed by the exponential of the dissipated utility
${\mathcal{U}}^{\mathrm{diss}}(\mathbf{x})$ similarly to the original Crooks equation.
Equation (
24) can be rewritten by rearranging the terms and averaging over
$p(\mathbf{x})$ as
Consequently, we see that purely from the trajectories of actions, we can obtain the average dissipated utility. We can test this relation in human experiments by comparing the trajectories of actions in two different conditions, first when having finite time and second when having as much time as needed. Then, from the probabilities of action trajectories, we can extract the average dissipated utility.
4.1.2. Jarzynski and Crooks Relations for Deliberating Continuous Decisions
Since many decision tasks take place in the continuous domain (for example, sensorimotor tasks), we now consider continuous state space problems. In particular, we repeat the same analysis as in the previous section by validating our Jarzynski equation, but this time in the continuous domain. Moreover, in this example, we allow for adaptive changes in the prior, such that the prior in one trial is equal to the posterior of the previous trial. In the following, we model decisionmaking as a diffusion process with Langevin dynamics that stops after a certain time t and emits an action x. The diffusion process uses gradient information to find the optimum utility and will converge to an equilibrium distribution for $t\to \infty $. In our example, we will employ quadratic utility functions that allow for a closed form solution of the nonequilibrium probability density that changes over time.
Let
$x(t)\in \mathbb{R}$ be the dynamics of computation that a decisionmaker carries out when deliberating. The differential equation that describes the dynamics is:
where
$\xi (t)$ is white Gaussian noise with mean
$\langle \xi (t)\rangle =0$ and correlation
$\langle \xi (t)\xi ({t}^{\prime})\rangle =2D\delta (t{t}^{\prime})$. Note that Equation (
25) is closely related to learning algorithms that use gradient information such as Stochastic Gradient Descent (SGD). These algorithms find the minimum of a cost function by taking steps in the state space in the opposite direction of the gradient. Here, we see that the learning rate corresponds to the parameter
$\alpha $, which, in contrast with plain GD, not only multiplies the gradient, but also the noise term.
Equation (
25) gives the dynamics of the decisionmaking process in terms of a stochastic differential equation, which can equivalently be expressed by the evolution of the probability
$p(x,t)$ described by the Fokker–Planck equation [
56]:
In order to compute the net utility, we need the probability of the nonequilibrium distribution up to a desired time
t; thus, we need to solve the Fokker–Planck equation. For quadratic utility functions
${U}_{y}(x)=({a}_{y}{x}^{2}+{b}_{y}x)$ with coefficients
${a}_{y}$ and
${b}_{y}$ for environment
y and initial Gaussian distribution with mean
${\mu}_{0}$ and variance
${\sigma}_{0}^{2}$, the solution is (see Appendix):
with:
where
$c=2\alpha {a}_{1}$, and we assumed that the prior strategy is Gaussian distributed with mean
${\mu}_{0}$ and variance
${\sigma}_{0}^{2}$. The precision parameter relates to the other parameters with the relation
$\beta =\frac{2\alpha}{D}$, which means that the higher the
$\alpha $, the more we take into account the gradient leading to a higher
$\beta $, and the lower the noise
D, also the higher
$\beta $.
Following a similar approach as in the previous section, we expose a decisionmaker to two utility functions given by
${U}_{1}(x)=0.2{x}^{2}0.4x0.8$ and
${U}_{2}(x)=0.4{x}^{2}1.8x+1.025$ shown in
Figure 5A. The prior for the first utility is given by
${\mu}_{0}=0$ and
${\sigma}_{0}^{2}=1$. In
Figure 5B, we show the net utility, equilibrium freeenergy differences and dissipated utility (according to Equations (
18) and (
19)) for different values of
$\beta $ and number of steps
k; corresponding to time
$t=k\Delta t$ in Equation (
27) for a given reference
$\Delta t$. In
Figure 5C, we show the convergence of the Jarzynski term towards the true equilibrium free energy difference term depending on the number of trajectories to make the estimation. We can see on the bottom row that the second law for decisionmaking represented by the inequality (
17) is fulfilled.
5. Discussion
In this paper, we highlighted the similarities between nonequilibrium thermodynamics and bounded rational decisionmaking in the case of agents that can deliberate before selecting an action and agents that cannot. Additionally, we derived a novel Jarzynski equality and a Crooks fluctuation theorem for decisionmaking scenarios with deliberation. We have shown how to use Jarzynski’s and Crooks’ equations in different scenarios to extract relevant variables of the decisionmaking process such as the equilibrium free energy difference, the average dissipated utility and the actionpath probabilities for both equilibrium posterior distributions and distributions of the backwardintime protocol. We have provided a number of examples for the nodeliberation and deliberation scenario, such as onestep lag dynamics, discrete choice tasks and continuous decisionmaking tasks that may be applicable both to cognitive and sensorimotor experiments [
57].
In
Section 3, we started out by directly translating physical nonequilibrium concepts to the decisionmaking domain in the case of decisionmakers that cannot deliberate before acting and therefore lag behind changes in the utility landscape. In analogy to physical systems, we assumed that such decisionmakers adapt to each utility change even though they are lagging behind, i.e., even after they have already chosen their action and there is no benefit of this adaptation at the current time step, but to improve their prior for the next choice. In physical systems, this does not constitute an issue, because there is a continuous adaptation to the energy gradient at every instant independent of how time is discretized. However, in the decisionmaking scenario, we assumed a single distinguished moment where the action is issued and the utility is evaluated. Therefore: Why should such decisionmakers adapt at all after the action has been selected? Following the argument of nofree lunch theorems, there would be no benefit in adapting to arbitrary changes. Having a closer look at our examples in
Section 3.3, it becomes evident that we implicitly assumed that the utility changes in each step were small, so there is a benefit in adapting the prior for the next trial. Such assumptions are typically made in learning scenarios, for example the i.i.d. assumption for inference problems or assumptions that utility changes in each time step are limited to a finite interval in decisionmaking problems. However, none of the nonequilibrium relations we discussed necessarily assume small utility changes. It should therefore be noted that, while the discussed nonequilibrium relations hold for arbitrary utility changes, in the context of nondeliberative decisionmaking, we would have to make additional assumptions such that utility changes in each step are small and can accumulate so that adaptation is beneficial. Importantly, the appropriateness of adaptation is not an issue when we assume a deliberation process where adaptation occurs before emitting an action, as there is a direct benefit of adaptation in the current trial. This is the general decision problem discussed in
Section 4.
While we have considered mainly nonsequential decisionmaking problems here for simplicity, the same formalism could also be applied to sequential decisionmaking problems. In that case, one would replace the notion that an action corresponds to a discrete or continuous state
x with the notion that an action might consist of choosing an entire trajectory
${x}_{1:\tau}$. In this case also, the utility
$U({x}_{1:\tau},t)$ would be defined over trajectories, and these utilities would change over episodes
t. Again, one would have to assume that the utility function does not change while the trajectory
${x}_{1:\tau}$ is generated. This corresponds to the fact that we assume that the utility is constant for each single episode
t (cf.
Figure 1), while the deliberative decisionmaker can, as it were, sample the new utility function before emitting an action. An example would be finding a trajectory for a pendulum swingup or a sequence of actions to navigate a maze. A path integral controller [
58] would for example exactly produce such trajectories. A deliberative decisionmaker would sample many such trajectories until time is up and one trajectory has to be selected, then the utility changes again, and the path integral controller samples new trajectories that have a different shape in line with the new utility function. Our assumption that the temporal evolution of the utility function does not depend on the decisionmaker’s action implies that consecutive episodes are independent and can have different utility functions, but the decisionmaker can carry its prior from one episode over to the next.
Recently, there has been a renewed interest in modelling decisionmaking with computational constraints [
59,
60] both in the computer science and the neuroscience literature, where there is growing evidence that the human brain might exploit sampling [
22,
61,
62,
63,
64,
65] for approximate inference and decisionmaking [
66,
67]. Such sampling models have been used for example to explain anchoring biases in choice tasks, because MCMC has finite mixing times and therefore exhibits a dependence on the prior distribution [
68,
69]. In particular, the idea of using the (expected) relative entropy or the mutual information as a computational cost has been suggested several times in the literature [
2,
3,
23,
33,
70,
71,
72]. In [
33] and similarly in [
20], the authors derive the relative entropy as a control cost from an informationtheoretic point of view, under axioms of monotonicity and invariance under relabelling and decomposition. In other fields such as robotics, the relative entropy has also been used as a control cost [
18,
21,
25,
58,
73,
74] to regularize the behaviour of the controller by penalizing controls that are far from the uncontrolled dynamics of the system or to deal with model uncertainty [
75]. Naturally, questions regarding the generality of entropic costs as informationprocessing costs and their potential relation to algorithmic spacetime resource constraints carry over to the nonequilibrium scenario and remain a topic for future investigations.
So far, only very few studies have established connections between nonequilibrium thermodynamics and decisionmaking in the literature, even though nonequilibrium analysis might provide a promising way to relate mechanistic dynamical models to conceptually simpler utilitybased models that are often employed as normative models. Jarzynskilike and Crookslike relations have been noted in the economics literature in gambling scenarios [
76] and when studying the arrow of time for decisionmaking [
77,
78]. We reported preliminary results for the onestep delayed decisionmaking in [
79,
80]. In the machine learning literature, generalized fluctuation theorems have recently been used in [
81] to train artificial neural networks with efficient exploration. In general, fluctuation theorems and Jarzynski equalities allow one to estimate free energy differences, which are very important in decisionmaking because the free energy directly relates to the value function, which is a central concept in control and reinforcement learning. Fluctuation theorems typically make the assumption that the temperature parameter is constant (isothermal transformations) and that initial states are in equilibrium. In our paper, we also made these assumptions, which may limit the generality of our results. Loosening these restrictions (cf. for example [
82,
83]) might be an important next step for future investigations of nonequilibrium relations in the decisionmaking context.
Regarding the connection between predictive power and dissipation, [
24] has found that nonpredictive systems are also systems that are highly dissipative. In [
24], the authors consider the effects of a stochastic driving signal
x mediated by an energy function
$E(x,s)$ on the state
s of a Markov system with fixed transition probability
$p({s}^{\prime}s,x)$. They regard the Markov system as a computing device and study how much information the state
s carries about the driving signal
x. They find a fundamental relationship between dissipation (energy efficiency) and lack of predictive power. Their results concern nonequilibrium trajectories when
x changes at every time point. The intuition is that when a system naturally moves in the direction of a changing energy landscape, then this is not only more efficient energetically, but it can also be interpreted in the sense that the system predicts the changing energy landscape. Once the system equilibrates, the energy landscape (i.e., the external variable x) does not change any more, and the mutual information between state and external variable xvanishes, as does the dissipation. Therefore, the equilibrium state is of no particular interest in this analysis. If one were to apply this framework to a decisionmaker, the decisionmaker would be represented by the system with the state
s, and the driving signal
x would be the input provided to the decisionmaker. One important difference between [
24] and our formulation is that in [
24], the driving signal
x is stochastic and is sampled from a stationary probability distribution, whereas in our formulation, we assume a fixed deterministic driving signal (the sequence of utility functions) without an underlying probability distribution. Assuming such a fixed input does prohibit an analysis in terms of mutual information between
s and
x. Nevertheless, it would be straightforward to allow for stochastic changes in the utility function also in our formulation, and the results of [
24] would be applicable and complementary. While in [
24], the equilibrium is of no particular interest, in our analysis, we are interested in the approach to equilibrium and in the resources spent on the way, that is the time that is spent during deliberating where the environment is assumed to be roughly constant, i.e., it does not change too much on the short time scale of deliberating, then the environment changes again, and the decisionmaker can adapt to this change by deliberation (in contrast, in [
24], the decisionmaker follows a fixed dynamics and does not adapt).
In conclusion, the results presented here bring the fields of stochastic thermodynamics and decisionmaking closer together by studying decisionmaking systems as statistical systems just like in thermodynamics. In this analogy, the energy function in physics corresponds to the utility functions in decisionmaking. Importantly, the statistical ensembles of both decisions and physical states can be conceptualized as nonequilibrium ensembles that reach equilibrium after a finite time adaptation process.