Constrained versus Unconstrained Rational Inattention

: The rational inattention literature is split between two versions of the model: in one, mutual information of states and signals are bounded by a hard constraint, while, in the other, it appears as an additive term in the decision maker’s utility function. The resulting constrained and unconstrained maximization problems are closely related, but, nevertheless, their solutions differ in certain aspects. In particular, movements in the decision maker’s prior belief and utility function lead to opposite comparative statics conclusions.


Introduction
The Rational Inattention (RI) model was introduced to economics by Sims [1,2], and it has been widely applied since then in a variety of fields. It is based on the premise that attention is a scarce resource for decision makers, and that these decision makers optimally allocate their attention, given the environment that they face. Sims suggested that a useful way for capturing the scarcity of attention is to impose a constraint on the quantity of information that the agent can process. Specifically, the constraint is that the average reduction of entropy, from the agent's prior belief about the state of the world to her posterior, can not exceed a given threshold. 1 The follow-up literature continued for the most part to use entropy reduction in order to measure informativeness, but two different versions of the model emerged: the first, which we call the 'constrained version', continues as in Sims to study problems of the form max x { f (x)} subject to g(x) ≤ c. Here, x is the information choice of the agent, 2 the objective f maps each choice to the expected utility that it generates, g(x) is the expected reduction of entropy induced by x, and c is the bound on the agent's capacity to process information. The second 'unconstrained version' instead analyzes maximization problems of the form max x { f (x) − λg(x)}, where x, f , g are the same as before and λ captures the marginal cost of attention.
The purpose of this note is to point out that, while the two versions of the problem are obviously closely related, their solutions differ in several important aspects. Thus, the conclusions reached when using one of these versions do not automatically transfer to the other and tests of the validity of the RI model may reach different conclusions, depending on which of the two versions is tested.
The connection between the two versions is as follows: the Lagrangian of the constrained version is given by f (x) − λ [g(x) − c], where λ is the multiplier of the RI 'budget constraint'. 3 Therefore, the first-order conditions with respect to x are the same as in the unconstrained version and, since these programs are convex, the conditions are also sufficient. Furthermore, as long as c is not too large the budget constraint binds. Therefore, x solves the constrained version if and only if (i) x solves the unconstrained version with some 1 The use of entropy for measuring quantity of information has its origins in [3] classical work on the capacity of a communication channel. 2 More precisely, the agent chooses both the information structure and the action to take conditional on the realized signal. See the next section. 3 The Lagrangian contains additional terms due to constraints associated with x being a collection of probability distributions. These constraints are common to the two versions and are not relevant for the current discussion. λ > 0 and (ii) the constraint binds at x. In the other direction, if x solves the unconstrained version, then it also solves the constrained version with parameter c = g(x). Despite this apparent equivalence, note that, in the constrained version, the Lagrange multiplier λ is determined endogenously, while, in the unconstrained version, it is part of the description of the problem. This is the reason underlying the differences between the solutions. First, for a fixed decision problem, the mapping from the parameter c in the constrained version to the corresponding multiplier λ need not be one-to-one, i.e., there may be an interval of c values mapped to the same multiplier, say λ * . These critical values λ * are associated with 'regime changes' in the unconstrained version, where the set of actions are considered by the agent shifts. When analyzing the unconstrained problem, these cases appear to be knife-edge and negligible, but, for the constrained problem, this is exactly where much of the "action" takes place. We demonstrate this phenomenon with a simple example (Section 3), and then show that it always happens in two families of decision problems (Propositions 2 and 3).
A byproduct of this observation is that some properties of the solution of the unconstrained version that have been emphasized in the literature fail to hold in the constrained version. For example, Caplin and Dean [4] show that, for the unconstrained problem, there is always a solution in which the number of posteriors (and, hence, the number of actions) chosen by the agent is, at most, the number of states. This is no longer true for the constrained problem: there may be intervals of c values at which any solution uses more actions than there are states. 4 Another example is the dependence of the optimal set of posteriors on the parameter. It is easy to see that any two different values of λ lead to different sets of posteriors in the solution to the unconstrained version. In the constrained version, on the other hand, there may be intervals of c values where the optimal posteriors stay fixed and only the allocation of mass between them changes as c varies.
Second, as the decision problem changes the mapping from c to λ changes with it, which leads to reversal of known comparative statics results for the unconstrained version. Specifically, one important property of the unconstrained version is that changes in the prior do not affect the optimal set of posteriors, so long as the prior remains within the convex-hull of these posteriors. This property was termed "locally invariant posteriors" (LIP) by Caplin and Dean [4], and it has been experimentally tested by Dean and Neligh [6]. For the constrained version, quite the opposite is true: if the set of optimal posteriors is affinely independent, which is often the case, then changes in the prior almost always lead to changes in the optimal posteriors. See Proposition 5 for details.
On the other hand, scaling up or down the stakes of the decision problem works in the opposite way: while the solution to the unconstrained version is sensitive to such changes, for the constrained version, the solution stays the same. Indeed, scaling up the utility function has exactly the same effect as scaling down the marginal cost of attention λ in the unconstrained version. When λ changes, the solution to the unconstrained problem changes with it, as already mentioned above. However, in the constrained problem, a rescaling of utility is accompanied by a corresponding rescaling of the multiplier λ, and the two cancel each other.
These differences between the two versions have simple testable implications and can help to guide the modeling of rationally inattentive agents. 5 For instance, Propositions 2 and 3 describe classes of decision problems, in which the two versions significantly differ in their predictions, offering a direct way to distinguish between them. Similarly, Proposition 5 on the failure of LIP in the constrained version can be used to refute the validity of this model while using experimental or empirical data.

Related Literature
This paper makes a theoretical contribution to the growing body of literature on RI, see Maćkowiak et al. [7] for a recent survey.
We work in a finite environment and make extensive use of the characterization of the solution of the unconstrained version in Matějka and McKay [8] (MM, henceforth) and Caplin et al. [9] (CDL, henceforth). For most of the analysis, we view the agent as choosing a distribution over posterior beliefs, rather than state-dependent distributions over actions, see e.g., Caplin and Dean [4] and Caplin et al. [10] for previous works using this approach.
The constrained and unconstrained versions are both extensively used in the literature. Roughly speaking, static models tend to adopt the unconstrained version, while dynamic models the constrained one, although there are exceptions in both directions. 6 De Oliveira [13] axiomatizes the unconstrained version of the RI model and comments that, due to the Lagrangian connection, the constrained version behaves similarly for small variations in the menu of available acts. Matějka [11] points out that, in his model, the multiplier λ decreases as c increases, and Fulton ([14], Theorem 2) makes a similar observation in a continuous Gaussian framework; as we show below, in the discrete case, the relationship is monotonic, but sometimes only weakly so.
Le Treust and Tomala [5] analyze the interaction between a sender and a receiver, who communicate through a noisy channel. The receiver faces a sequence of n identical decision problems and the sender sends k messages through the channel. The main result is that, as k, n grow, the payoff of the sender converges to the value of the constrained version of the RI problem, with c being determined by the channel's capacity. They then show that the number of posteriors in the solution to the constrained problem can always be chosen to be, at most, one more than the number of states, and they give an example showing that this bound is tight. 7 Their example is similar to the one that is given below in Section 3. Relative to that paper, the contribution of our Propositions 2 and 3 below is to show that more actions than states in the solution is a 'robust' phenomenon that holds for intervals of c values and, in general, classes of decision problems.
Our results may be relevant for the experimental tests of the RI model and for its estimation. Dean and Neligh [6] use the testable implications that were identified by Caplin and Dean [17] and by Caplin et al. [10] in order to study whether subjects' choices are consistent with models of costly information acquisition in general, and with the RI unconstrained model in particular. One of their findings is that subjects pay more attention (more likely to make the right choice) when the stakes are higher, which is consistent with the unconstrained version of the RI model, but not with the constrained version (see Proposition 6). Dewan and Neligh [18] observe a similar kind of behavior by most subjects (60%) in their experiment; however, note that many subjects were non-responsive to scaling up of the incentives, which suggests that the constrained model may better fit a significant fraction of the population. 8 Dean and Neligh [6] also test the LIP property and find that it generally holds. In view of Proposition 5, this is another indication that the unconstrained model does a better job in explaining the data. It would be interesting to see whether LIP holds more generally in other kinds of decision problems and with other implementations of attention costs.
Finally, Cheremukhin et al. [19] use laboratory data in order to estimate a hybrid model that includes the two versions considered in this paper as special cases. The behavior of approximately 70% of their subjects is better described by an additive cost term than by a capacity constraint. This further suggests that the unconstrained model is a better fit for most decision makers, but, at the same time, that one should not dismiss the constrained version as irrelevant.

Two Versions of the RI Problem
For the most part, we follow the notation of CDL in order to facilitate an easy comparison. There is a finite set Ω of states, with ω ∈ Ω denoting a typical state. The prior belief of the decision maker (DM) is µ ∈ ∆(Ω), where ∆(X) is the set of probability distributions over any finite set X. We assume that µ assigns positive probability to every ω ∈ Ω. The finite set of available actions is A. For each pair (a, ω) ∈ A × Ω, the utility of the DM when she chooses action a and the realized state is ω is denoted by u(a, ω). A decision problem is described by the triplet (µ, A, u).
Throughout, we restrict attention to decision problems that satisfy two assumptions: first, actions are not duplicates, i.e., u(a, ω) = u(a, ω) for some ω whenever a = a . Second, different actions are optimal in different states, i.e., if ω = ω , then arg max a u(a, ω) ∩ arg max a u(a, ω ) = ∅. The first assumption is purely for expositional reasons. As for the second, all of our results still hold without this assumption, but the upper end of the range of the cost parameter c for which they hold may decrease.
The DM chooses an information structure, i.e., a mapping from states to distributions over some set of signals, as well as which action to play after observing each signal. However, it is without loss ([8], Lemma 1, e.g.) to restrict attention to information structures with at most one signal per action in A and to identify signals with the actions that they induce. Therefore, the choice variable is a mapping P : Ω → ∆(A), where P(a|ω) is the probability of action a conditional on state ω. With slight abuse of notation, P also denotes the unconditional probability of actions, which is P(a) = ∑ ω µ(ω)P(a|ω). Following CDL, the consideration set of P is B(P) = {a ∈ A : P(a) > 0}.
In order to state the problem, it is useful to introduce one more piece of notation. If p ∈ ∆(X) for some finite set X, then H(p) = − ∑ x∈X p(x) ln p(x) is the entropy of p. We will use H for distributions over A as well as over Ω.
In their papers, MM and CDL consider the following maximization problem, where λ > 0 is an exogenous parameter: ( ) The first term is the expected utility that the DM obtains by conditioning her choice on the observed signal, while the second term in parentheses is the expected reduction of entropy from the marginal distribution to the state-contingent distribution of actions, which captures the cost of attention.
We compare program ( ) with the following constrained maximization problem with parameter 0 < c < H(µ): 9 Here the objective only includes the benefit of receiving information, and the constraint requires that the expected reduction in entropy does not exceed the 'budget of attention' c. As mentioned above, this second formulation corresponds to the original RI problem that was introduced by Sims [2].
The relationship between programs of the form ( ) and ( ) is well-understood in general. While λ is part of the input in the former, it is the Lagrange multiplier of the con-straint in the later, and, therefore, is endogenously determined. The following proposition formalizes this connection.

Proposition 1. The mapping P solves program ( ) if and only if
(i) P satisfies the budget constraint (1) with equality; and, (ii) P solves ( ) with some λ > 0.
The fact that the budget constraint necessarily binds for any c ∈ (0, H(µ)) is a consequence of our assumption that different actions are optimal at different states; a proof is provided in Appendix A. Once this is established, and given that these are convex programs, the rest of the proof easily follows from the KKT theorem ( [20], Corollary 28.3.1, e.g.), and, therefore, is omitted.
Let Z(a, ω, λ) = exp u(a,ω) λ . MM and CDL prove that P solves ( ) if and only if it satisfies for every a and ω, and, in addition for every a.
It is often more convenient to work with the distribution over posteriors that are induced by P than with P itself. Namely, instead of choosing P : Ω → ∆(A), we can equivalently think of the DM as choosing the unconditional probabilities of actions {P(a)} a∈A and the posteriors {γ a ∈ ∆(Ω)} a∈B(P) subject to ∑ a P(a)γ a (ω) = µ(ω) for every ω ∈ Ω. 10 CDL show that conditions (2) and (3) can be rewritten as for every a, b ∈ B(P), ω ∈ Ω, and for every a ∈ B(P) and b ∈ A. Furthermore, the budget constraint (1), which must bind at the optimum, can be rewritten as

An Example
We now illustrate the differences between the solutions of the two problems with an example; in the next section, we show that these differences hold more generally. Let Ω = {ω 0 , ω 1 }. Because Ω only has two elements, we identify ∆(Ω) with the [0, 1] interval and describe its elements by the probability that the state is ω 1 . The set of actions is A = {l, m, r} standing for left, middle, and right. The following table provides the utility function: 10 Indeed, starting from P(a|ω), define P(a) = ∑ ω µ(ω)P(a|ω) and let γ a (ω) = µ(ω)P(a|ω) for any a ∈ B(P). In the other direction, set P(a|ω) = P(a)γ a (ω) Thus, m is a safe action with a sure payoff of zero; l and r are risky actions, where l gives a high payoff at the 'left' state ω 0 and r gives a high payoff at the 'right' state ω 1 . Note that l is optimal for beliefs γ ∈ [0, 1/3], m is optimal for γ ∈ [1/3, 2/3], and r is optimal for γ ∈ [2/3, 1]. See Figure 1. We use the distribution over posteriors (P(l), P(m), P(r)), {γ a } a∈B(P) for the analysis, as this makes it easier to visualize the solution. We break the analysis to three different cases, depending on the location of the prior µ ∈ (0, 0.5], and, for each case, compare the solutions of ( ) and ( ). 11  illustrate the solutions of the three cases, while proofs of the claims can be found in the Appendix A.
In order to describe the solution, it is useful to introduce additional notation. First, Case 1: µ = 0.5 For problem ( ), the solution is as follows. If λ > λ * then P(m) = 1 and γ m = 0.5. If 0 < λ < λ * then P(l) = P(r) = 0.5 and γ l = 1 − γ r = 1 + exp 3 λ −1 . If λ = λ * , then any mixture (including degenerate) of the former two solutions is optimal. Thus, if the cost parameter λ is high, then the DM chooses m for sure without seeking any information; once λ falls below the threshold level λ * , the DM chooses sufficiently informative symmetric signals, so that she ends up choosing either l or r for sure; only at the cutoff λ * there is the possibility that all three actions are considered, and even in this case B(P) = {m} and B(P) = {l, r} are still optimal. The value of λ * is determined by the condition that the 'net utilities' (see [9]), i.e., the difference between expected utility and cost, of the three associated posteriors 0.5, γ * , 1 − γ * are equal.
Moving onto problem ( ), first note that it can never (for no c > 0) be optimal to choose no information, since, by Proposition 1, the budget constraint must bind. When c is small, specifically 0 < c < H(0.5) − H(γ * ), the solution is given by P That is, for c in this interval, we have B(P) = {l, m, r} and, as c increases, the posteriors stay constant while more probability is shifted from the middle γ m = 0.5 to the extremes γ l = γ * and γ r = 1 − γ * . In light of Proposition 1, this is possible, because, for all c in this range, the corresponding value of the lagrange multiplier is λ * . However, in contrast to the previous paragraph, it is strictly beneficial here to the DM to move as much mass as possible to the extreme posteriors, since the cost does not directly enter the objective. For larger values of c, specifically, , the solution is given by P(l) = P(r) = 0.5 with the posteriors γ l = 1 − γ r determined by the equation H(0.5) − H(γ l ) = c. Figure 2 illustrates the consideration sets of the two solutions. The figure also shows the mapping from c to the corresponding value of λ at the optimum of ( ). While this mapping is weakly decreasing (a higher c implies weakly lower λ), it is neither one-to-one nor onto the entire range of λ's. Case 2: γ * < µ < 0.5 Figure 3 illustrates the solution for this case. In ( ), when λ < λ * , the solution is similar to the previous case µ = 0.5: the consideration set is {l, r}, the posteriors are , and the probabilities are set to satisfy P(l)γ l + P(r)γ r = µ.

Model Comparison
In this section, we state several results regarding the relationship and differences between the two versions of the RI problem. These results generalize the insights that are gained from the above example. All of the proofs not in the main text can be found in the Appendix A.
We start with the following lemma that describes the correspondence between c and λ. This correspondence is key for the subsequent analysis. Lemma 1. Fix a decision problem (µ, A, u). For every c ∈ (0, H(µ)), there is a unique λ > 0, denoted λ(c), such that every solution to ( ) with c solves ( ) only with λ(c). The mapping c → λ(c) is continuous and (weakly) decreasing on (0, H(µ)), and lim c→H(µ)− λ(c) = 0.
Lemma 1 implies that the choice consistent with optimization in problem ( ) with c can be rationalized as optimal behavior in ( ) only for one value of λ, namely λ(c). The continuity and monotonicity of λ(c) imply that its image contains the entire interval lim c→H(µ)− λ(c), lim c→0+ λ(c) = (0, lim c→0+ λ(c)). Therefore, for every λ in this interval, the optimal behavior in ( ) can be rationalized as optimal behavior in ( ) with some c.
However, as illustrated in the example, for a given λ, there may be multiple solutions that correspond to different values of c, so a solution to ( ) with λ(c) need not be optimal (or even feasible) for ( ) with c.

Consideration Sets and Optimal Posteriors
Perhaps the most apparent difference between the solutions of ( ) and ( ) in the example is that there is an interval of c values for which the solution of ( ) has all three actions {l, m, r} considered, while, in ( ), this can only happen at a single point λ * , and, even at λ * , there are other solutions that only involve subsets of actions. It is well-known [4] that this feature of the solution of ( ) is true in general: in every decision problem, there is always a solution in which the size of the consideration set is, at most, |Ω|. We now show that, in several cases of interest, the solution of ( ) behaves quite differently. Thus, the example of Section 3 is in no way special.
The first result generalizes the example to any decision problem with a binary statespace, at least three actions, and a not-too-extreme prior.
In the next proposition, we consider a class of decision problems similar to the one analyzed in (Caplin et al. [9], Section 3.1), but with an additional action corresponding to the outside option of the DM. Let Ω = {ω 1 , . . . , ω m }. Consider the decision problem in which A = {a 1 , . . . , a m , o}, and the utility function is given by Thus, if the DM correctly guesses the state, then her payoff is 1, while any wrong guess yields a payoff of 0. In addition, the safe choice o guarantees a payoff of t. 14 For this class of problems, we obtain a similar result to that of Proposition 2.
The assumption that 1 − t < µ(ω i ) for each i guarantees that the optimal action at the prior µ is the outside option o, and that µ is centrally located, in the sense that it is not in the convex-hull of any collection of m − 1 posteriors at which the a i actions are optimal.
The intuition for the last two propositions is similar to the one in the example: when λ is small, the DM obtains precise information guaranteeing that one of the 'extreme' actions (the a i 's) will be selected. When λ is relatively large, the DM may seek some information, but it will often end up at a posterior in which a safer action is optimal (o in Proposition 3). The transition between these two regimes happens at λ * . Moreover, at λ * , mixtures of these two types of solutions are also optimal, so there is a range of c values mapped to λ * and both types of actions are considered.

Remark 1.
Another known property of ( ) is that the same set of posteriors can not be a solution with two different values of λ, while assuming that some information is obtained (this immediately follows from condition (4) above). As shown in the example, this is not true for the parameter c in ( ): there is an interval of c values, such that the chosen posteriors are fixed, and only the allocation of mass over these posteriors changes as c varies. The proofs of Propositions 2 and 3 make it clear that the same is also true in these families of decision problems.

An 'Anything Goes' Result
We end this section by arguing that the testable implications of the constrained model are limited if the analyst does not know the DM's utility function and prior. Namely, any finite set of posteriors that is not convex independent can arise as the solution to problem ( ) for an interval of c values in some decision problem. Note that the set of posteriors can be arbitrarily large.

Proposition 4.
Fix Ω, let n ≥ 3 be an arbitrary integer, and consider a collection Γ = {γ i } n i=1 of distinct elements in the relative interior of ∆(Ω). If Γ is not convex independent (i.e., if there isγ i in the convex-hull of Γ \ {γ i }), then there is a decision problem (µ, A, u) and 0 < c <c < H(µ) such that for every c ∈ (c,c) (i) there is a solution of ( ), in which the set of posteriors is Γ; and (ii) if γ / ∈ Γ, then γ is not part of any solution of ( ).
A couple of comments are in order. First, property (ii) of the proposition guarantees that optimal posteriors must be in Γ; in particular, the decision problem is non-trivial in the sense that not every P is optimal. Second, while we know that the set Γ is also a solution to the unconstrained problem for some λ, Remark 1 above implies that it is not robustly so in the sense that arbitrarily small changes in λ would change the optimal set of posteriors. This is in contrast to the constrained version, in which Γ remains optimal for all c ∈ (c,c).

Locally Variant Posteriors
The "locally invariant posteriors" (LIP) property [4] states that changes in the prior do not affect the optimal set of posteriors for ( ) whenever this set is still feasible. Dean and Neligh [6] experimentally test this property and find that it is generally satisfied.
In problem ( ), on the other hand, arbitrarily small changes in the prior typically induce different sets of optimal posteriors. Consider the example of Section 3 with some given prior µ, and suppose that c is large enough, so that the optimal consideration set is {l, r}. The optimal posteriors satisfy γ l = 1 − γ r in this case. Therefore, by the symmetry of H, the cost of an optimal P is given by H(µ) − P(l)H(γ l ) + P(r)H(γ r ) = H(µ) − H(γ l ).
Because the budget constraint binds, we must have H(µ) − H(γ l ) = c. Therefore, if c is fixed and the prior µ changes, then the optimal posteriors γ l , γ r must also change.
The reason for the failure of LIP is clear: suppose that we fix a set of affinely independent posteriors {γ a }. For µ in the convex-hull of this set, say µ = ∑ a P(a)γ a , the cost associated with the choice of these posteriors is H(µ) − ∑ a P(a)H(γ a ). The first term is strictly concave in µ, while the second is linear in µ (since the vector {P(a)} changes linearly with µ). Therefore, with a fixed set of posteriors, the cost is a strictly concave function of the prior µ, which implies that changes in the prior typically lead to variation in the cost. Because the budget constraint binds, the set of posteriors associated with an optimal solution must adjust to keep the cost unchanged. Therefore, we have the following.

Proposition 5.
Consider a decision problem (µ, A, u) and parameter c, such that ( ) has a unique solution P with posteriors {γ a } a∈B(P) that are affinely independent. Then the set of priors µ for which the same set of posteriors {γ a } a∈B(P) is optimal for decision problem (µ , A, u) with parameter c is a nowhere dense subset of ∆(Ω). 15 Another way to think about the failure of LIP is that, under the conditions of the proposition, λ(c) for decision problem (µ, A, u) is usually different than λ(c) for (µ , A, u). When λ changes, the optimal set of posteriors in ( ) changes with it, as mentioned in the previous subsection.
We note that the assumption of affinely independent posteriors can not be dispensed with. Indeed, going back to the example, consider the case µ = 0.5 and c not too large, so that the solution to ( ) has B(P) = {l, m, r}. For µ close to µ and the same c, the solution to ( ) still has B(P) = {l, m, r}. Additionally, when all three actions are considered, the posteriors must be γ l = γ * , γ m = 0.5, and γ r = 1 − γ * . Finally, the difference between the unconstrained and constrained versions of the RI model that are exhibited in Proposition 5 is much more general than in the case where cost is measured by a reduction of entropy. Indeed, both the LIP property in the unconstrained version and its failure in the constrained version continue to hold so long as cost is measured by the expected reduction in the value of a strictly concave function of posteriors.

Utility Scaling
While the solution of problem ( ) is sensitive to changes in the prior, it does not change when the stakes of the decision problem are scaled up or down. Proposition 6. Consider a decision problem (µ, A, u) and suppose that P solves ( ) with parameter c. Then P is also a solution of ( ) with parameter c in the decision problem (µ, A, ρ · u) for every ρ > 0.
The argument is straightforward: since P is optimal in (µ, A, u), it follows from Proposition 1 that P solves ( ) for this decision problem with some λ. Therefore, P also solves ( ) for the scaled decision problem (µ, A, ρ · u) with λ = ρ · λ. Because the cost of P does not change with the problem, Proposition 1 implies that P also solves ( ) for (µ, A, ρ · u) with the same c.
Notice that stakes do matter in the unconstrained version: scaling up the utility has the exact same effect as scaling down the marginal cost of information λ, which, as already discussed, necessarily changes the solution.

Optimality of 'No Information'
For large values of λ, the solution of ( ) often involves the DM choosing not to be informed at all, as demonstrated by the example. Indeed, that was the case for any prior µ, except for µ = 1 3 . On the other hand, since the budget constraint must bind, choosing no information can not be optimal in ( ) for any c > 0.
In light of Lemma 1, this gap between the two versions occurs if and only if the limit λ := lim c→0+ λ(c) is finite. Indeed, there is a zero-cost (i.e., uninformative) solution to ( ) with λ if and only if λ ≥λ. In the next proposition, we characterize those decision problems for whichλ is finite and show that this is the 'typical' case.
Formally, we say that P is uninformative if P(·|ω) is the same for all ω ∈ Ω, or, equivalently, if the posterior is equal to the prior with probability 1. Additionally, given (A, u), say that the prior µ is an indifference point if there are two different actions a, a ∈ A, such that both a and a are optimal at belief µ, i.e., {a, a } ⊆ arg max b ∑ ω µ(ω)u(b, ω). Note that, for a given (A, u), the set of indifference points is "small" in ∆(Ω), e.g., it is nowhere dense and has Lebesgue measure zero in ∆(Ω) viewed as a subset of R |Ω| − 1.

Proposition 7.
Suppose that µ is not an indifference point of (A, u) and let a * be the (unique) optimal action given belief µ. Then the limitλ = lim c→0+ λ(c) is finite and, for every λ >λ, the unique solution to ( ) is given by P(a * |ω) = 1 for all ω. If µ is an indifference point, then λ = +∞ and uninformative P's are never optimal.
Intuitively, when λ is large condition (4) that characterizes the solution to ( ) requires that the posteriors are close to each other and, therefore, close to the prior µ. Thus, if µ is not an indifference point, then, for large enough λ, the same action that is optimal at µ is also optimal at all posteriors. But then obtaining no information yields the same expected utility at a lower cost. Conversely, when µ is an indifference point, the marginal value of little information is positive, since it allows the DM to learn which of the a-priory optimal actions is better. The marginal cost of little information is zero due to the smoothness of the entropy function. Therefore, it is not optimal to obtain no information.
Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Conflicts of Interest:
The author declares no conflict of interest.

Appendix A. Proofs
Appendix A.1. Notation The following notation and remarks will be used in several of the proofs below. We use P to denote the state-dependent stochastic choice of the DM, i.e., P : Ω → ∆(A). The distribution over posteriors induced by P is denoted by Γ(P). Recall that the support of Γ(P) is the collection {γ a } a∈B(P) , where γ a (ω) = µ(ω)P(a|ω)

P(a)
, and P(a) = ∑ ω µ(ω)P(a|ω) is the probability assigned to posterior γ a . Given P, P and α ∈ [0, 1], the mixture of the induced distributions over posteriors αΓ(P) + (1 − α)Γ(P ) is defined as usual: The support of the mixture is the union of the supports of Γ(P) and Γ(P ), and the probability of each γ in the support is the corresponding average of probabilities of γ in the two distributions.
Remark A1. Suppose that P solves either ( ) or ( ). If a ∈ B(P), then choosing a must be optimal given belief γ a , that is a ∈ arg max b ∑ ω γ a (ω)u(b, ω). Indeed, if that was not the case then choosing the same distribution over posteriors but playing some action from the arg max would increase the utility of the DM without changing the cost of information. Furthermore, a must be the unique optimal action given belief γ a . Indeed, if there are multiple optimal actions at some induced posterior, then this is inconsistent with optimality, as shown in the proof of Proposition 7.
Remark A2. Given P and P , if a ∈ B(P) ∩ B(P ) and the posteriors γ a induced by P and γ a induced by P are not equal, then the mixture αΓ(P) + (1 − α)Γ(P ) has two different posteriors associated with the same action a. In particular, if P and P are both optimal in ( ), and if a ∈ B(P) ∩ B(P ), then γ a = γ a . Indeed, if P and P are optimal, then so is any mixture of Γ(P) and Γ(P ).
Given a decision problem (µ, A, u), we write for the expected utility that the DM obtains with choice P. Notice that this can also be expressed using Γ(P) as

. Proofs of Propositions
Proof of Proposition 1. Fix a decision problem (µ, A, u) and c ∈ (0, H(µ)). We prove here that if P is optimal for ( ) then the budget constraint (1) binds. As explained in the text, the rest of the argument is a standard application of the KKT theorem.
For each ω ∈ Ω let A * (ω) = arg max a u(a, ω). DenoteV = ∑ ω µ(ω)u(a * (ω), ω), where a * (ω) ∈ A * (ω). Then a choice P gives the DM expected utility ofV if and only if ∑ a∈A * (ω) P(a|ω) = 1 for every ω. Our assumption that different actions are optimal at different states means that A * (ω) ∩ A * (ω ) = ∅ whenever ω = ω . It follows that if P achievesV then P reveals the realized state with probability one, and therefore that C(P) = H(µ). Now, suppose that P is such that the constraint is slack, C(P) < c. Because c < H(µ) it follows from the previous paragraph that V(P) <V. For > 0 define P( ) = P + (1 − )P, whereP satisfies V(P) =V. Then V(P( )) = V + (1 − )V(P) > V(P), and since C is continuous in P we have that C(P( )) < c for small enough, so P( ) is feasible. This shows that P is not a solution for ( ).
Proof of Lemma 1. Fix c and let P be a solution to ( ). By Proposition 1 the budget constraint binds, implying that P must be informative. In particular, there must be two actions a = b such that a, b ∈ B(P). Moreover, Proposition 1 implies that there is λ > 0, such that conditions (4) and (5) are satisfied. From (4) we have that for every ω or, after rearranging, Because we assumed that actions are not duplicates, there exists some ω at which u(a, ω) − u(b, ω) = 0, which implies that the right-hand side of the last equation is strictly monotone in λ. It follows that λ is pinned down uniquely by P. Denote this λ by λ(c). Now, suppose that P also solves ( ) with the same c. By Proposition 1 P solves ( ) with some λ > 0. We claim that it must be the case that λ = λ(c). Indeed V(P ) = V(P) since they are both optimal in ( ), and, since the budget constraint binds, C(P ) = c = C(P). This implies that P is also optimal for ( ) with λ , so by the previous paragraph λ = λ(c).
Next, we prove that λ(c) is (weakly) decreasing. Let c > c and suppose per absurdum that λ(c) > λ(c ). Let P be optimal for ( ) with c and P optimal for ( ) with c . Then where the first inequality follows from P being optimal for ( ) with λ(c), the next equality holds since the budget constraint in ( ) binds, the strict inequality is by the assumptions that c > c and λ(c) > λ(c ), and the last equality is again by the binding budget constraint. Rearranging gives contradicting the optimality of P for ( ) with λ(c ). Finally, we show that the image of λ(c) contains the entire open interval (0, lim c→0+ λ(c)). Combining this with the monotonicity proved above implies both continuity and lim c→H(µ)− λ(c) = 0. Suppose thatλ is in this interval. Let P be optimal for ( ) withλ. Notice that C(P) = 0 is impossible, since that would imply that any λ >λ also has an uninformative solution, contradicting the assumption thatλ < lim c→0+ λ(c). It also can not be that C(P) = H(µ) since that would require the posteriors to be at the vertices of the simplex, contradicting (4). Thus, 0 < C(P) < H(µ) and it follows from Proposition 1 that P is also optimal for ( ) with c = C(P). Therefore,λ = λ(C(P)), i.e.,λ is in the image. This completes the proof.

Proof of Proposition 2.
Because there are only two states, we identify distributions over Ω with the probability they assign to ω 1 . Define Λ to be the set of all λ > 0, such that there exists a solution P to ( ) satisfying B(P) = {a 0 , a 1 }. We break the proof into several claims. Claim 1. The set Λ is non-empty and bounded from above.
Proof. If P is optimal and a, b ∈ B(P) then by (4) As λ → 0 we either get γ b → 0 and γ a → 1, or vice versa. 16 From Remark A1, a and b must be optimal given beliefs γ a and γ b , respectively. This is only possible if {a, b} = {a 0 , a 1 } when λ is sufficiently small. It is also clear that choosing 'no information' is not optimal for λ small enough, since it is not optimal at λ = 0. Thus, every sufficiently small λ > 0 is in Λ.
On the other hand, as λ → +∞ the ratio γ a γ b converges to 1. Because µ must be in the convex-hull of the induced posteriors, all the induced posteriors necessarily converge to µ as λ → +∞ (see the proof of Proposition 7 for details). Additionally, since we assumed that neither a 0 nor a 1 are optimal at µ, this implies that these actions are not considered for large enough λ. Thus, Λ is bounded from above.
Proof. First, consider a sequence {λ n } converging to λ * from below, such that for each n there is a solution P n to ( ) with λ n satisfying B(P n ) = {a 0 , a 1 }. Such a sequence exists by the definition of λ * . By taking a subsequence if needed we may assume that P n converges. By the theorem of the maximum the limit P * is optimal at λ * . In addition, we must have B(P * ) = {a 0 , a 1 }: For every a = a 0 , a 1 we have P n (a) = 0 for all n, implying P * (a) = 0. Additionally, it is impossible that B(P * ) = {a 0 } or B(P * ) = {a 1 } since these actions are not optimal at µ.
Second, let {λ n } be a sequence converging to λ * , but this time from above. Let P n be a corresponding solution sequence such that |B(P n )| ≤ 2 for each n. 17 Subsequently, the limit P * * = lim n P n is optimal at λ * , and by the definition of λ * we have B(P n ) = {a 0 , a 1 } for every n, also implying that B(P * * ) = {a 0 , a 1 }. Claim 3. For P * and P * * constructed in Claim 2, C(P * ) > C(P * * ).
Proof. Because H is strictly concave over ∆(Ω), it is sufficient to show that Γ(P * ) is a mean-preserving spread of Γ(P * * ). We consider two different cases. Suppose first that B(P * * ) ∩ B(P * ) = ∅, i.e., neither of the two extreme actions is in the consideration set of P * * . Subsequently, by Remark A1, every posterior induced by P * * is in-between the two posteriors induced by P * . This implies that Γ(P * ) is a mean-preserving spread of Γ(P * * ) as needed.
Proof. Assume, contrary to the claim, that there is a sequence c n converging toc from below and a corresponding sequence P n , such that P n solves ( ) with c n and |B(P n )| ≤ 2 for every n. Because of the budget constraint binds, we have C(P n ) = c n for all n, so for n large enough P n also solves ( ) with λ * . Denote B(P n ) = {a(n), b(n)} and the corresponding posteriors by γ a(n) , γ b(n) (it can not be that |B(P n )| = 1 since then C(P n ) = 0). Additionally, recall that γ a 0 * , γ a 1 * are the posteriors induced by P * . There are three cases to consider: First, it can not be that {a(n), b(n)} = {a 0 , a 1 }, since by Remark A2 this would imply γ a(n) = γ a 0 * and γ b(n) = γ a 1 * , contradicting the assumption that c n <c. Second, suppose that {a(n), b(n)} = {a 0 , a} for some action a = a 1 . Then again by Remark A2 this would imply γ a(n) = γ a 0 * . Additionally by Remark A1 the other posterior γ a is smaller and bounded away from γ a 1 * , contradicting the assumption that c n →c. The argument for {a(n), b(n)} = {a 1 , a} is analogous.

Proof of Proposition 3.
For the most part the proof follows the footsteps of the proof of Proposition 2. We only provide details when a different argument is needed. Proof. Suppose by contradiction that P is optimal and B(P) {a 1 , . . . , a m }. From Remark A1, if a j ∈ B(P) then the associated posterior γ a j must satisfy γ a j (ω j ) > t, implying that γ a j (ω i ) < 1 − t for any i = j. Let i be such that a i / ∈ B(P). Then contradicting the assumption in the proposition.
Define Λ to be the set of all λ > 0, such that there exists a solution P to ( ), satisfying B(P) = {a 1 , . . . , a m }. Claim 6. The set Λ is non-empty and bounded from above.
For o to be considered, Condition (5) requires that for every Summing up over all i gives Since t < 1 this inequality clearly can not hold for λ > 0 small enough. To show that Λ is bounded, note that the assumption of the proposition implies that o is the unique optimal action at µ. By Proposition 7, this implies that obtaining no information is the unique optimal choice for all λ large enough.
Proof. The proof is identical to that in Claim 2, except that to argue that B(P * ) = {a 1 , . . . , a m } we need to use Claim 5.
Proof. We denote by γ i * the posterior corresponding to a i induced by P * and by γ i * * the one induced by P * * (if a i ∈ B(P * * )).
Because o / ∈ B(P * ), P * must coincide with the solution given in ( [9], Theorem 1) when all of the a i 's are considered. In particular, the induced posteriors are symmetric in the sense that γ i * (ω i ) is the same for all i. It follows that V(P * ) = γ 1 * (ω 1 ). On the other hand, by Remark A2 we must have γ i * = γ i * * whenever a i ∈ B(P * * ), implying that where the strict inequality follows from o ∈ B(P * * ) (recall Claim 5) and t < γ 1 * (ω 1 ). Therefore, we got that V(P * * ) < V(P * ), and since both are optimal for ( ) with λ * it must be that C(P * ) > C(P * * ). Claim 9. Denotec = C(P * ). There is c <c such that if P is optimal for ( ) with c ∈ (c,c) then B(P) = A.
Proof. Suppose by contradiction that c n ↑c, and that for each n P n solves ( ) with c n but B(P n ) = A. Then C(P n ) = c n , so for n large enough P n also solves ( ) with λ * . Since c n <c we can not have B(P n ) = {a 1 , . . . , a m }, so by Claim 5 o ∈ B(P n ). The exact same argument as in the previous claim gives V(P n ) = P n (o)t + (1 − P n (o))γ 1 * (ω 1 ). Moreover, P n (o) is bounded away from zero, since µ is not in the convex-hull of any strict subset of {γ i * } m i=1 . This implies that there is δ > 0 such that V(P * ) − V(P n ) > δ for every n. However, P * and P n are both optimal for ( ) with λ * , so V(P * ) − λ * C(P * ) = V(P n ) − λ * C(P n ). This contradicts the convergence of c n = C(P n ) toc = C(P * ).
To complete the description of the decision problem we need to choose the prior µ. By assumption, one of the elements of Γ, sayγ 1 , is in the convex hull of the others. Define i for some 0 < α < 1. Consider the set We claim that this set contains a non-degenerate interval of c values. Indeed, one element of this set is (1 − α) + α n H γ 1 + α n ∑ n i=2 H γ i . Additionally, we haveγ 1 = ∑ n i=2p iγ i for some probability vectorp = {p i } i≥2 , so α n H γ 1 + ∑ n i=2 (1 − α)p i + α n H γ i is also in this set. The strict concavity of H implies that the former is strictly larger than the latter. By taking convex combinations of these two representations of µ we can get any c in between. Denote c = H(µ) − sup C andc = H(µ) − inf C. Note that 0 < c <c < H(µ).
Finally, consider problem ( ) with c ∈ (c,c). Subsequently, by the definition of c andc, there are strictly positive We claim that this distribution over posteriors is optimal. Indeed, this gives an expected utility of where the first equality is by the definition of u and the affinity of f j , the inequality is obvious, the next equality follows from f i γ i = f γ i , and the last equality is by construction of {P(i)}. On the other hand, for any feasible distribution over posteriors where the first equality is again by the definition of u and the affinity of f j , the inequality follows from f j ≤ f for all j, and the last inequality follows from feasibility (the reduction of entropy must be at most c). Moreover, if γ i / ∈ Γ for some i, then f j γ i < f γ i for every j, so the first inequality is strict. This completes the proof.
Proof of Proposition 7. From Proposition 1 and Lemma 1 it immediately follows thatλ is equal to the infimum of the set of λ's for which an uninformative P is optimal in problem ( ). Thus, to prove the proposition, it is enough to show that µ is an indifference point if and only if the set of such λ's is empty.
We start by showing that if µ is not an indifference point then for λ large enough the optimal solution to ( ) is uninformative. For this proof, we view ∆(Ω) as the unit simplex of R Ω endowed with the metric d(γ, γ ) = max ω |γ(ω) − γ (ω)|. Let be the set of beliefs at which a * is the unique optimal action. Because K is relatively open in ∆(Ω) and µ ∈ K, there is δ > 0 such that γ ∈ K whenever d(γ, µ) < δ.
Let M = max a,b max ω {u(a, ω) − u(b, ω)} > 0. Suppose that λ is large enough, so that exp M λ < 1 + δ. If P is optimal for ( ) with λ, the,n for every a, b ∈ B(P) and every where the first equality is by (4), and the next inequality is by the definition of M. We thus get where in the first equality we used the fact that optimal posteriors are never on the boundary of ∆(Ω) (implied by condition (4)), and the following inequality is by (A1). Now, since µ = ∑ b∈B(P) P(b)γ b , we also have for every a ∈ B(P) and every ω Taking the maximum over ω gives where the last inequality is from (A2). By the construction of δ this implies that γ a ∈ K for every a ∈ B(P). Thus, by Remark A1, P induces a unique posterior and is therefore uninformative.
In the other direction, we now claim that, if µ is an indifference point, then an uninformative P can not be a solution of ( ) with any λ. Indeed, let a, a be two optimal actions given belief µ. ConsiderP, as given byP(a|ω) =P(a |ω) = 0.5 for every ω. Note that P is optimal among the set of uninformative P's. However, we have B(P) = {a, a } and γ a = γ a = µ. Because a, a are not duplicates, condition (4) can not hold for any λ > 0, implying thatP is not optimal. 18 Appendix A. 3

. Proofs of Claims in the Example
We start with the following lemma. and .

Proof.
(i) Condition (4) with a = l and b = r requires in ω 0 . It is immediate to verify that these two equations are equivalent to γ l = 1 − γ r = . This pins down λ at λ * , and consequentially the posteriors γ l , γ m , γ r , as above.
Suppose now that λ > λ * . We need to show that obtaining no information is optimal. Condition (4) is trivially satisfied, so we only need to check (5) for a = m and b = l, r. For b = l, this gives 0.5 1 exp 1 which is equivalent to λ ≥ λ * . For b = r, we get the exact same condition. Finally, when λ = λ * it follows from the last two paragraphs that both B(P) = {m} and B(P) = {l, r} are still optimal. Because the set of optimal distributions over posteriors is convex, it follows that any mixture of these solutions is optimal as well. Note that at λ = λ * the posteriors of the risky actions in the solution are given by γ l = 1 − γ r = γ * .  . Afterwards, condition (4) is satisfied by Lemma A1 (i). Furthermore, since H(0.5) − H(γ l ) = c ≥ H(0.5) − H(γ * ) we know that γ * ≥ γ l , which implies that λ ≤ λ * . Thus, (5) holds for a = l, r and b = m in the same way as in the case λ < λ * of the previous claim. , and the probabilities are adjusted, so that P(l)γ l + P(m)γ m = µ. Finally, for λ ≥λ(µ), it is optimal to obtain no information, and the consideration set is either {l} or {m} when µ is below or above 1/3, respectively.
Proof. The proof for λ < λ * is identical to the µ = 0.5 case. Suppose that λ * < λ <λ(µ) and we prove that B(P) = {l, m} is optimal. By Lemma A1 (ii), condition (4) is equivalent . We need to check condition (5)  , which boils down to λ * ≤ λ. For a = m, the condition is , which, again, is equivalent to λ * ≤ λ. Finally, we need that µ is in the interior of the convex hull of γ l and γ m . It is not hard to check that this is equivalent to λ <λ(µ). Note that at λ = λ * both solutions with B(P) = {l, r} and with B(P) = {l, m} are optimal, and therefore any mixture is optimal as well. This implies that B(P) = {l, m, r} is also optimal.
Proof. Fix λ <λ(µ) and set the posteriors as in the claim. Subsequently, (4) holds by Lemma A1 (i). As in the previous cases above, condition (5) with a = l, r and b = m is equivalent to λ ≤ λ * ; for µ in the current range it is easy to check thatλ(µ) ≤ λ * , so the condition holds by assumption. Finally, we need that µ ∈ (γ l , γ r ), which is equivalent to λ <λ(µ).