Information-Theoretic Bounded Rationality and-Optimality

Bounded rationality concerns the study of decision makers with limited information processing resources. Previously, the free energy difference functional has been suggested to model bounded rational decision making, as it provides a natural trade-off between an energy or utility function that is to be optimized and information processing costs that are measured by entropic search costs. The main question of this article is how the information-theoretic free energy model relates to simple -optimality models of bounded rational decision making, where the decision maker is satisfied with any action in an -neighborhood of the optimal utility. We find that the stochastic policies that optimize the free energy trade-off comply with the notion of -optimality. Moreover, this optimality criterion even holds when the environment is adversarial. We conclude that the study of bounded rationality based on -optimality criteria that abstract away from the particulars of the information processing constraints is compatible with the information-theoretic free energy model of bounded rationality.


Introduction
Decision making under uncertainty is studied by means of optimal actor models in a broad spectrum of sciences with remarkably different historical roots, like economics, artificial intelligence research, biology, sociology, and even fields, like legal studies, ethics and philosophy [1][2][3].Usually, when we talk about decision making, we imagine a human mind (for example, a chess player) that ponders a variety of possible options for action, deliberates about their potential outcomes and finally picks one of these actions for execution; namely, the one that is expected to have the most beneficial consequences.Recently, the same paradigm has also been extended to model sensorimotor integration and control [4][5][6], where consequences of actions can be anticipated by implicit learning processes.Crucially, however, in either case, classic decision-theoretic models [3,7] ignore the details of the underlying cognitive or implicit processes preceding a decision by simply assuming that these processes optimize a performance criterion.This ignorance is both boon and bane, as, on the one hand, it allows the statement of many general results that do not depend on the details of the decision making process, but on the other hand, the often unrealistic assumption of perfect optimization limits the applicability of classic decision theory.
Classic decision theory rests on two conceptual pillars: the notion of probability and the notion of utility.Their intertwined occurrence may be best understood on the basis of the concept of lotteries.A lottery is defined as a set of N different outcomes o j ∈ O each of which can occur with a respective probability P (o j ) where j = 1, . . ., N .We can imagine a lottery as a roulette wheel or a gamble where we obtain a prize o j with probability P (o j ) that has a subjective utility U (o j ) for the decision maker.The compound value of the lottery can then be determined by the expected utility E[U ] = j P (o j )U (o j ), which is commonly used as the standard performance criterion in decision making.The concept of expected utility was first axiomatized by Neumann and Morgenstern [8].In their axiomatic system, Neumann and Morgenstern [8] define a binary preference relation over the set of probability distributions P defined over the set of outcomes O.If (and only if) this binary relation satisfies the axioms of completeness, transitivity, continuity and independence, then there exists a function U : O → R, such that: where P, P ∈ P.This utility function U is unique up to a positive affine transform.When designing optimal actors, most designers use the Neumann and Morgenstern [8] conception of probability and utility; see for example Russell and Norvig [2].Such optimal actors are typically equipped with a probabilistic model of the world P (o j |a i ), where a i ∈ A is an action that leads to consequence o j with probability P (o j |a i ).The decision maker can assess the expected utility of each action as E[U |a i ] = j P (o j |a i )U (o j ).Thus, the probabilistic model of the world defines a set of M different lotteries indexed by a i , where i = 1, . . ., M .The decision maker can compare the expected utilities of all the lotteries and choose the one with the highest expected utility, such that: However, there are at least two important assumptions.First, the decision maker requires an accurate probability model.Second, the decision maker requires enough computational resources to find the best lottery.What happens if one of the two assumptions is violated?This question has spurred research on bounded rationality where decision makers have limited knowledge and bounded computational resources.
The modern study of bounded rationality began with Herbert Simon [9][10][11] and has since been continued in economics [12][13][14], game theory [15][16][17], industrial organization [18] and political science [19], but also in psychology [20,21], cognitive science [22][23][24], computer science and artificial intelligence research [25][26][27].One of the fundamental questions faced by bounded rationality models is whether they should attend to the actual physical or cognitive processes underlying decision making or whether it is also possible to gain a more general understanding of bounded rational decision making by abstracting away from the details of the actual decision making process.While the first approach is taken, for example, by the new field of neuroeconomics relating decision making processes to anatomical structures [28,29], one of the simplest approaches in the second tradition is the concept of -optimality [30], where the decision maker does not search for a single best action a max , but for any action from a set of permissible actions A whose expected utility deviates at most by > 0 from the optimal expected utility of a max , such that: The main question of this article is how to relate this simple model of bounded rationality to the information-theoretic bounded rationality model discussed in Ortega and Braun [31][32][33][34] that we recapitulate in the next section.

Methods
Most models of decision making ignore information processing costs and assume that the decision maker can simply handpick the action that yields the highest (expected) utility.Presupposing that there is a unique maximum, this would correspond to a deterministic strategy as in Equation (1).In general, however, a decision maker with limited information processing capabilities might be unable to handpick the best option with certainty.Such a bounded rational strategy must therefore be described by a probability distribution P (a i ) reflecting this uncertainty.Information-theoretic models of bounded rational decision making quantify the cost of information-processing by entropic measures of information [15][16][17][31][32][33][34][35] and are closely related to softmax-choice rules that have been extensively studied in the psychological and econometric literature, but also in the literature on reinforcement learning and game theory [36][37][38][39][40][41][42].In [31][32][33][34], Ortega and Braun discuss an information-theoretic model of bounded rational decision making where information processing costs are quantified by the relative entropy with the idea that information processing costs can then be measured with respect to changes in the choice strategy P (a i ).
Let us assume that the initial strategy of the decision maker can be described by a probability distribution P 0 (a i ).This could include the uniform distribution over a i as a special case, if the decision maker has no prior preferences between different actions.Next, this decision maker is exposed to a utility function V (a i ), which includes the case of V (a i ) = E[U |a i ], implying that the decision maker does not have to compute the expectation values, but the expectation values are simply given.Ideally, the decision maker will arrive at the new distribution P (a i ) = δ a i ,amax .The underlying computation can be imagined as a search process that reduces the uncertainty over the action by In general, such a search is costly, and the decision maker might not be able to afford such a stark reduction in uncertainty.Assuming a price 1/α for 1bit of information gain, we can then design a bounded optimal decision maker that trades off gains in utility resulting from changes in P (a i ) against the search costs that these changes imply, such that, overall, the decision maker optimizes a free energy difference in utility gains and information costs: where the maximizing distribution P = arg max P ∆F [ P ] is the equilibrium distribution: where and represents the choice probabilities after deliberation.Note that the free energy difference ∆F [ P ] can be expressed as ∆F [ P ] = F 1 [ P ] − F 0 , with the free energies: Hence, the utility function V (a i ) expresses changes in value Φ, that are gains or losses with respect to the status quo.In the case of inference, the utility function is given by a negative log-likelihood and measures informational surprise.
The temperature parameter corresponds then to a precision parameter in exponential family distributions.
Casting the problem of acting as an inference problem has been previously discussed in [43][44][45][46][47][48].The certainty-equivalent value V CE under strategy P can be determined from the same variational principle: For the two different limits of α, the value and the equilibrium distribution take the asymptotic forms: It can be seen that a perfectly rational agent with α → ∞ is able to handpick the optimal action, which is a deterministic policy in the case of a unique optimum, whereas finitely rational agents have stochastic policies with a non-zero probability of picking a sub-optimal action.In the case that V (a i ) are not simply given, the decision maker has to compute the expectation values herself from the prior P 0 (o j |a i ) and the utility U (o j ), such that search costs have to be considered both for a i and o j .The variational problem can then be formulated as a nested expression [32,34,49]: If we assume that the estimation of the expected utilities V (a i ) is much cheaper than the calculation of the optimal action, then the price 1/β should be much lower than 1/α, such that α β, implying that we can simply obtain samples from P 0 (o j |a i ) for our computation of the expectation, but that it is much more difficult to compute a i , because we cannot simply rely on our prior P 0 (a i ).The two-part solution to the nested variational problem is given by: with the normalization constant: Z β (a i ) = j P 0 (o j |a i ) exp (βU (o j )) and: with the normalization constant: The perfectly rational decision maker is obtained in the limit α → ∞ and β → 0, that is: The computational complexity of the information-theoretic model of bounded rational decision making can also be interpreted in terms of a sampling complexity [50,51].In particular, Equation ( 4) can be interpreted under a rejection sampling scheme where we want to obtain samples from P (a i ), but we are only able to sample from the distribution P 0 (a i ).In this scheme, we generate a sample a i ∼ P 0 (a i ) and then accept the sample if: where u is drawn from the uniform U[0; 1] and T is the acceptance target value with T ≥ max i V (a i ).
Otherwise, the sample is rejected.The efficiency of the sampling process depends on how many samples we will need on average from P 0 to obtain one sample from P .This average number of samples from P 0 needed for one sample of P is given by the mean of a geometric distribution: It is important to note that the average number of samples increases exponentially with increasing the rationality parameter, such that: where a max = arg max V (x) and T > max i U (a i ).This interpretation in terms of sampling complexity can also be extended to Equation (6), where the decision maker has to estimate the expected utilities from samples.In line with Equation (8), we should accept a sample a i ∼ P 0 (a i ) if it fulfils the criterion: where u ∼ U[0; 1] and T ≥ 1 β log Z β (a i ).From Equation (11), we know that the ratio Z β (a i )/e βT can be interpreted as an acceptance probability; in this case, the acceptance probability of θ ∼ P 0 (θ).Thus, in order to accept one sample from x, we need to accept α β consecutive samples of θ, with acceptance criterion: with u ∼ U[0; 1] and T as set above.

Results
Here, we investigate the question of how close a bounded rational decision maker gets to the optimal (expected) utility achieved by the perfectly rational decision maker.Since we assume that the strategy of a bounded rational decision maker is inherently stochastic and can be described by a probability distribution according to Equation (4), we can only compare some statistical measure of the performance of the bounded rational decision maker to the performance of the perfectly rational decision maker.In the following, we will consider the expected performance.
Theorem 1 ( -Optimality).Given a bounded rational decision maker with information cost 1/α that optimizes (3), one can bound the expected performance of this decision maker from below within an -neighborhood of the optimal performance V max = max i E[U |a i ] of the perfectly rational decision maker, such that: .
Proof.The certainty-equivalent value V CE under the bounded rational strategy P (a i ) is given by: where P (a i ) = 1 Z P 0 (a i )e αV (a i ) .From the positiveness of the Kullback-Leibler divergence, it follows that: As a corollary, we can conclude for the special case of uniform prior P 0 (a i ) = 1/M that the -bound is given by = 1/α log M .Conversely, given an > 0, there exists an ᾱ = log M , such that for α ≥ ᾱ, any decision taken yields a utility within epsilon of the optimum.
In the case of ( 6), the bounded rational decision maker has to determine the expected utilities by sampling, and the above lower bound cannot be guaranteed anymore.Instead of the expected utilities V (a i ) = E[U |a i ], such a decision maker optimizes the "distorted" certainty-equivalent value: with Z β (a i ) from Equation (7).Only for β → 0, the expectation value such a decision maker with positive β will overestimate the certainty-equivalent value for sub-optimal actions a i .For small β 1, the certainty-equivalent value can be approximated by a Taylor expansion in β: where O(β 2 ) are higher-order cumulants that can be neglected.Due to Theorem 1, we have: from which we can conclude for the limit β 1 and α β that: . For such a bounded rational decision maker, the error bound is increased by higher order cumulants.If all of the (expected) utilities V (a i ) are very similar in magnitude, it requires a high rationality parameter α to differentiate between them.A tighter -bound in α can be given, if we assume that there is an interval V (a i ) ∈ [V min ; V max ] and that all the utilities are discriminable by at least one "utile", such that for any choice a i and a k , we have |V (a i )−V (a k )| ≥ 1, which is the case, for example, when utilities reflect rank.
Theorem 2 ( -Optimality for rank utilities).Given a bounded rational decision maker with information cost 1/α that optimizes Equation (3) and assuming a uniform prior P 0 (a i ) = 1/M , bounded (expected) utilities V (a i ) ∈ [V min ; V max ] for all i and |V (a i ) − V (a k )| ≥ 1 for every pair (i, k), one can bound the expected performance of this decision maker from below within an -neighborhood of the optimal performance V max = max i E[U |a i ] of the perfectly rational decision maker, such that: . Proof.We express the choice probability P (a i ) derived from Equation (4) under uniform prior P 0 (a i ) = 1/M as: where we have introduced the variable δ = exp(−α).We can then express the expected performance as: where the inequality is obtained by taking out the largest summand and then finding a lower bound for the remaining terms.The second summand in the last equality can be further delimited as: , k and the limit properties of the geometric series.Therefore, we have: As a corollary, we can conclude in the case of minimal interval size [V min ; V max ] = [V min ; V min + M ] that the performance bound is given by i P (a i )V (a i ) ≥ V max − e −α M .Conversely, given an > 0, there exists an ᾱ = log Vmax−V min , such that for α ≥ ᾱ, any decision made yields a utility within epsilon of the optimum.

Adversarial Environments
So far, we have considered stochasticity in action selection to arise due to limited computational power, even in the absence of any uncertainty in the environment.Naturally, in this setting, stochastic choice yields less (expected) utility than deterministic choice of the best option, but the performance decrement can be bounded by .If, however, the environment is potentially adversarial, stochastic action selection can also be superior in terms of utility alone, since it does not allow the opponent to perfectly predict and thwart any deterministic action plan that the decision maker might have.In the following, we will discuss two different scenarios for decision making in adversarial environments, where the decision maker chooses between different actions a i ∈ A with (expected) utility

Unknown Action Set
In the first scenario, we assume that the decision maker starts by choosing a probability distribution P (a i ) over actions a i ∈ A, and then, the environment chooses a subset S ∈ P(A)\{} of permissible actions, where P(A) denotes the powerset.All actions that are not part of the subset are eliminated.Finally, the action a i is randomly determined from the set of permissible actions with their renormalized probabilities.The problem is to find the betting probability P (a i ) such that we maximize our expected return; however, the expectation has to be taken over the unknown subset S capriciously chosen by the opponent.This models a decision maker, who has to choose a generic hedging strategy by allocating resources to different alternatives, but where the rules of the game are only fully revealed after the choice is made.Formally, we want to choose the probability P (a i ), such that the conditional expectation E[V (a i )|S] is as large as possible.Unsurprisingly, we cannot provide a deterministic optimal solution P (a i ) = δ(a i − a * ), since the environment could always eliminate a * .However, if we allow ourselves an arbitrarily small, non-zero performance loss > 0, then there is a way to assign probabilities P (a i ), such that the conditional expectation is almost equal to the optimum, i.e., to the highest utility in the subset chosen by the opponent.This is precisely the result of the following theorem.
Theorem 3 ( -Optimality in adversarial environments).The expected utility achieved by a bounded rational decision maker that optimizes (3) lies within an -neighborhood of the optimal utility V S max = max a i ∈S V (a i ) in S for any subset S of possible actions selected by nature, such that: Proof.
where P (a i ) = 1 Z P 0 (a i )e αV (a i ) .We can then apply Theorem 1 to the expression in the last equality to find that: where a S max = arg max a i V S (a i ).As a corollary, we obtain in the case P 0 (a i ) = 1 M an -bound of = 1 α log M .Similarly, Theorem 2 holds for any chosen subset S, such that: .

Unknown Utility
In the second scenario of an adversarial environment, the agent chooses a distribution P 0 (a i ) and the environment subsequently chooses V (a i ) in an arbitrary fashion, such that, in general, the choice of V (a i ) may depend on P 0 (a i ).Once the V (a i ) are revealed, the decision maker updates the choice strategy according to Equation (4).Importantly, the new distribution P (a i ) is not used as a choice strategy to choose between the different V (a i ) as in the previous theorems, but is only used in a later choice with new, yet unknown utilities.If we denote the trial number or time step by t and assume a trial-by-trial update: where the utilities V t (a i ) are bounded in each time step to lie within the unit interval, that is V t (a i ) ∈ [0; 1], then the expected performance of the decision maker can be bounded from below by: where = exp(α) − 1.This performance bound can be derived from a hedging analysis originally proposed by Freund and Shapire in a full information game where the decision maker learns about all possible utilities V t (a i ) in each time step [52,53].In this case, the decision maker chooses between i different options with probability p i (t) = w i (t)/ j w j (t), where the weights w i (t) are updated according to: and where V i (t) is the utility of option i at time t.It is straightforward to see that a bounded rational decision maker following Equation ( 4) is hedging, when acting according to P t (a i ) before receiving feedback V i (t); that is, the bounded rational decision maker has a delay of one time step, as it is the distribution P t+1 (a i ) that is bounded optimal for the utility V i (t) under the prior P t (a i ).

Discussion and Conclusion
Information-theoretic bounded rationality can be viewed as a prescriptive model of optimal decision making when the decision maker can only afford a certain amount of information processing.Information processing is formalized as a change in probability distribution from a prior distribution representing an a priori choice strategy to a posterior distribution over actions after information processing has taken place.Such changes in distributions can be measured by the relative entropy between prior and posterior distribution and be related to actual physical state changes in thermodynamic systems [34], where the concept of energy is analogous to the concept of utility and computational costs are analogous to entropic costs that reduce the system's capability to do work.This interpretation builds on previous work that has related computational and physical processes; see for example [54] for an overview.As discussed in the Methods, the cost of changing distributions can also be expressed in terms of complexity of sampling processes [50,51].
In this paper, we show that we can abstract away even further both from physical and computational processes when modeling bounded rational decision making with entropic information processing constraints.We show that the performance of information-theoretic bounded rational decision makers can be -bounded compared to the perfectly rational decision maker and that, therefore, information-theoretic bounded rationality naturally implies -optimality.In this sense, bounded rational decision making is strictly inferior to perfect rationality, which selects deterministically the best action.This, however, changes in adversarial environments.We discuss two scenarios.In the first scenario, the opponent can eliminate any non-empty subset of actions from the choice set after the decision maker has specified her strategy.Here, bounded rationality allows defining an -optimal performance criterion under any subset.In the second scenario, the opponent can arbitrarily select utilities for each action, and the agent responds with the bounded rational strategy with respect to the previous utilities.This scenario is equivalent to hedging and also comes with performance bounds, but in contrast to the previous setting, these bounds do not correspond to -optimality, since the difference between optimal and actual utility also depends on a multiplicative factor.
The concept of -optimality has been previously discussed in the economic literature, in particular within the context of game theory and the solution concept of -equilibria [55,56].In particular, Fudenberg and Levine [57] have investigated the concept of -universal consistency in games where players learn a smooth best response to another player from observations.They could show that learning with a softmax-decision rule performs within an -bound of the best response with known frequencies of the opponent's play.Importantly, the concept of -optimality extends the usual black box approach taken in perfect rationality models of economic decision making where the details of the reasoning process are ignored [30].In -optimality models, the decision maker is assumed to make decisions that are (approximately) optimal; how these decisions are arrived at is largely ignored.The choice of the in such models is typically arbitrary.Here, we link the parameter quantitatively to the temperature parameter of information-theoretic bounded rationality, that is a Lagrange multiplier indicating the shadow price of changing the distribution representing the choice strategy.
Economic models of decision making are usually considered to be as if models.The fact that behavior is consistent with an optimality criterion does not imply that an actual optimization process causes this behavior.Similarly, we could consider the information-theoretic bounded rationality model as an as if model, where the decision maker behaves as if optimizing a trade-off between utility and information cost or as if optimizing utility under information processing constraints.In contrast, when engineering an optimal decision maker (for example, a planning algorithm in a robot), typically the utility function is provided by the engineer, and the action is selected by the system after an optimization process.Here, we can consider the information-theoretic bounded rationality model as an anytime search for the optimum that stops when resources run out.Most importantly, however, independent of whether one regards utility functions as causal for behavior or not, bounded rational decision making does not necessarily imply optimizing a constrained optimization problem that is more difficult to solve than the original unconstrained problem, but the decision maker can be regarded as optimizing utility until running out of resources, thereby implicitly optimizing the constrained problem.