Optimal Decision Rules in Repeated Games Where Players Infer an Opponent’s Mind via Simpliﬁed Belief Calculation

: In strategic situations, humans infer the state of mind of others, e.g., emotions or intentions, adapting their behavior appropriately. Nonetheless, evolutionary studies of cooperation typically focus only on reaction norms, e.g., tit for tat, whereby individuals make their next decisions by only considering the observed outcome rather than focusing on their opponent’s state of mind. In this paper, we analyze repeated two-player games in which players explicitly infer their opponent’s unobservable state of mind. Using Markov decision processes, we investigate optimal decision rules and their performance in cooperation. The state-of-mind inference requires Bayesian belief calculations, which is computationally intensive. We therefore study two models in which players simplify these belief calculations. In Model 1, players adopt a heuristic to approximately infer their opponent’s state of mind, whereas in Model 2, players use information regarding their opponent’s previous state of mind, obtained from external evidence, e.g., emotional signals. We show that players in both models reach almost optimal behavior through commitment-like decision rules by which players are committed to selecting the same action regardless of their opponent’s behavior. These commitment-like decision rules can enhance or reduce cooperation depending on the opponent’s strategy.


Introduction
Although evolution and rationality apparently favor selfishness, animals, including humans, often form cooperative relationships, each participant paying a cost to help one another.It is therefore a universal concern in biological and social sciences to understand what mechanism promotes cooperation.If individuals are in a kinship, kin selection fosters their cooperation via inclusive fitness benefits [1,2].If individuals are non-kin, establishing cooperation between them is a more difficult problem.Studies of the Prisoner's Dilemma (PD) game and its variants have revealed that repeated interactions between a fixed pair of individuals facilitates cooperation via direct reciprocity [3][4][5].A well-known example of this reciprocal strategy is Tit For Tat (TFT), whereby a player cooperates with the player's opponent only if the opponent has cooperated in a previous stage.If one's opponent obeys TFT, it is better to cooperate because in the next stage, the opponent will cooperate, and the cooperative interaction continues; otherwise, the opponent will not cooperate, and one's total future payoff will decrease.Numerous experimental studies have shown that humans cooperate in repeated PD games if the likelihood of future stages is sufficiently large [6].
In evolutionary dynamics, TFT is a catalyst for increasing the frequency of cooperative players, though it is not evolutionarily stable [7].Some variants of TFT, however, are evolutionarily stable; Win Stay Lose Shift (WSLS) is one such example in which a player cooperates with the player's opponent only if the outcome of the previous stage of the game has been mutual cooperation or mutual defection [8].TFT and WSLS are instances of so-called reaction norms in which a player selects an action as a reaction to the outcome of the previous stage, i.e., the previous pair of actions selected by the player and the opponent [9].In two-player games, a reaction norm is specified by conditional probability p(a|a , r ) by which a player selects next action a depending on the previous actions of the player and the opponent, i.e., a and r , respectively.
Studies of cooperation in repeated games typically assume reaction norms as the subject of evolution.A problem with this assumption is that it describes behavior as a black box in which an action is merely a mechanical response to the previous outcome; however, humans and, controversially, non-human primates have a theory of mind in which they infer the state of mind (i.e., emotions or intentions) of others and use these inferences as pivotal pieces of information in their own decision-making processes [10,11].As an example, people tend to cooperate more when they are cognizant of another's good intentions [6,12].Moreover, neurological bases of intention or emotion recognition have been found [13][14][15][16][17]. Despite the behavioral and neurological evidence, there is still a need for a theoretical understanding of the role of such state-of-mind recognition in cooperation; to the best of our knowledge, only a few studies have focused on examining the interplay between state-of-mind recognition and cooperation [18,19].
From the viewpoint of state-of-mind recognition, the above reaction norm can be decomposed as: p(a|a , r ) = ∑ s p(a|s)p(s|a , r ) where s represents the opponent's state of mind.Equation ( 1) contains two modules.The first module, p(s|a , r ), handles the state-of-mind recognition; given observed previous actions a and r , a player infers that the player's opponent is in state s with probability p(s|a , r ), i.e., a belief, and thinks that the opponent will select some action depending on this state s.The second module, p(a|s), controls the player's decision-making; the player selects action a with probability p(a|s), which is a reaction to the inferred state of mind of opponent s.In our present study, we are motivated to clarify what decision rule, i.e., the second module, is plausible and how it behaves in cooperation when a player infers an opponent's state of mind via the first module.To do so, we use Markov Decision Processes (MDPs) that provide a powerful framework for predicting optimal behavior in repeated games when players are forward-looking [20,21].MDPs even predict (pure) evolutionarily stable states in evolutionary game theory [22].The core of MDPs is the Bellman Optimality Equation (BOE); by solving the BOE, a player obtains the optimal decision rule, which is called the optimal policy, that maximizes the player's total future payoff.Solving a BOE with beliefs, however, requires complex calculations and is therefore computationally expensive.Rather than solving the BOE naively, we instead introduce approximations of the belief calculation that we believe to be more biologically realistic.We introduce two models to do so and examine the possibility of achieving cooperation as compared to a null model (introduced in Section 2.2.1) in which a player directly observes an opponent's state of mind.In the first model, we assume that a player believes that an opponent's behavior is deterministic such that the opponent's actions are directly (i.e., one-to-one) related to the opponent's states.A rationale for this approximation is that in many complex problems, people use simple heuristics to make a fast decision, for example a rough estimation of an uncertain quantity [23,24].In the second model, we assume that a player correctly senses an opponent's previous state of mind, although the player does not know the present state of mind.This assumption could be based on some external clue provided by emotional signaling, such as facial expressions [25].We provide the details of both models in Section 2.2.2.

Analysis Methods
We analyze an MDP of an infinitely repeated two-player game in which an agent selects optimal actions in response to an opponent, who behaves according to a reaction norm and has an unobservable state of mind.For the opponent's behavior, we focus on four major reaction norms, i.e., Contrite TFT (CTFT), TFT, WSLS and Grim Trigger (GRIM), all of which are stable and/or cooperative strategies [4,8,[26][27][28][29].These reaction norms can be modeled using Probabilistic Finite-State Machines (PFSMs).

Model
For each stage game, the agent and opponent select actions, these actions being either cooperation (C) or defection (D).We denote the set of possible actions of both players by A = {C, D}.When the agent selects action a ∈ A and the opponent selects action r ∈ A, the agent gains stage-game payoff f (a, r).The payoff matrix of the stage game is given by: where the four outcomes, i.e., mutual cooperation ((the agent's action, the opponent's action) = (C, C)), one-sided cooperation ((C, D)), one-sided defection ((D, C)) and mutual defection ( Depending on S and T, the payoff matrix (2) yields different classes of the stage game.If 0 < S < 1 and 0 < T < 1, it yields the Harmony Game (HG), which has a unique Nash equilibrium of mutual cooperation.If S < 0 and T > 1, it yields the PD game, which has a unique Nash equilibrium of mutual defection.
If S < 0 and 0 < T < 1, it yields the Stag Hunt (SH) game, which has two pure Nash equilibria, one being mutual cooperation, the other mutual defection.If S > 0 and T > 1, it yields the Snowdrift Game (SG), which has a mixed strategy Nash equilibrium with both mutual cooperation and defection being unstable.Given the payoff matrix, the agent's purpose at each stage t is to maximize the agent's expected discounted total payoff E [∑ ∞ τ=0 β τ f (a t+τ , r t+τ )], where a t+τ and r t+τ are the actions selected by the agent and opponent at stage t + τ, respectively, and β ∈ [0, 1) is a discount rate.The opponent's behavior is represented by a PFSM, which is specified via probability distributions φ and w.At each stage t, the opponent is in some state s t ∈ S and selects action r t with probability φ(r t |s t ).Next, the opponent's state changes to next state s t+1 ∈ S with probability w(s t+1 |a t , s t ).We study four types of two-state PFSMs as the opponent's model, these being Contrite Tit for Tat (CTFT), Tit for Tat (TFT), Win Stay Lose Shift (WSLS) and Grim Trigger (GRIM).We illustrate the four types of PFSMs in Figure 1 and list all probabilities φ and w in Table 1.Here, the opponent's state is either Happy (H) or Unhappy (U), i.e., S = {H, U}.Note that H and U are merely labels for these states.An opponent obeying the PSFMs selects action C in state H and selects action D in state U with a stochastic error, i.e., φ(C|H) = 1 − (hence, φ(D|H) = ) and φ(C|U) = (hence, φ(D|U) = 1 − ), where > 0 is a small probability with which the opponent fails to select an intended action.For all four of the PFSMs, the    , U)

Bellman Optimality Equations
Because the repeated game we consider is Markovian, the optimal decision rules or policies are obtained by solving the appropriate BOEs.Here, we introduce three different BOEs for the repeated two-player games assuming complete (see Section 2.2.1) and incomplete (see Section 2.2.2) information about the opponent's state.Table 2 summarizes available information about the opponent in these three models.In our first scenario, which we call Model 0, we assume that the agent knows the opponent's present state, as well as which PFSM the opponent obeys, i.e., φ and w.At stage t, the agent selects sequence of actions {a t+τ } ∞ τ=0 to maximize expected discounted total payoff where E s t is the expectation conditioned on the opponent's present state s t .Let the value of state s, V(s), be the maximum expected discounted total payoff the agent expects to obtain when the opponent's present state is s, given that the agent obeys the optimal policy and, thus, selects optimal actions in the following stage games.Here, V(s t ) is represented by: which has recursive relationship: The BOE when the opponent's present state is known is therefore represented as: where we rewrite V as V (0) for later convenience.Equation (5) reads that if the agent obeys the optimal policy, the value of having the opponent in state s t (i.e., the left-hand side) is the sum of the expected immediate reward when the opponent is in state s t and the expected value, discounted by β, of having the opponent in the next state s t+1 (i.e., the right-hand side).Note that the time subscripts in the BOEs being derived hereafter (i.e., Equations ( 5), ( 10) and ( 12)) can be omitted because they hold true for any game stages, i.e., for any t.

Incomplete Information Cases (Models 1 and 2)
For the next two models, we assume that the agent does not know the opponent's state of mind, even though the agent knows which PFSM the opponent obeys, i.e., the agent knows the opponent's φ and w.The agent believes that the opponent's state at stage t is s t with probability b t (s t ), which is called a belief.Mathematically, a belief is a probability distribution over the state space.At stage t + 1, b t is updated to b t+1 in a Markovian manner depending on information available at the previous stage.The agent maximizes expected discounted total payoff where E b t is the expectation based on present belief b t .Using the same approach as that of Section 2.2.1 above, the value function, denoted by V(b t ), when the agent has belief b t regarding opponent's state s t has recursive relationship: The BOE when the opponent's present state is unknown is then: where b t+1 is the belief at the next stage.Equation (7) reads that if the agent obeys the optimal policy, the value of having belief b t regarding the opponent's present state s t (i.e., the left-hand side) is the expected (by belief b t ) sum of the immediate reward and the value, discounted by β, of having next belief b t+1 regarding the opponent's next state s t+1 (i.e., the right-hand side).
We can consider various approaches to updating the belief in Equation ( 7).One approach is to use Bayes' rule, which then is called the belief MDP [30].After observing actions a t and r t in the present stage, the belief is updated from b t to b t+1 as: Equation ( 8) is simply derived from Bayes' rule as follows: (i) the numerator (i.e., Prob(r t , s t+1 |b t , a t )) is the joint probability that the opponent's present action r t and next state s t+1 are observed, given the agent's present belief b t and action a t ; and (ii) the denominator (i.e., Prob(r t |b t )) is the probability that r t is observed, given the agent's present belief b t .Finding an optimal policy via the belief MDP is unfortunately difficult, because there are infinitely many beliefs, and the agent must simultaneously solve an infinite number of Equation (7).To overcome this problem, a number of computational approximation methods have been proposed, including grid-based discretization and particle filtering [31,32].When one views the belief MDP as a biological model for decision-making processes, these computational approximations might likely be inapplicable, because animals, including humans, tend to employ more simplified practices rather than complex statistical learning methods [24,33,34].We explore such possibilities in the two models below.
A Simplification Heuristic (Model 1) In Model 1, we assume that the agent simplifies the opponent's behavioral model in the agent's mind by believing that the opponent's state-dependent action selection is deterministic; we replace φ(r|s) in Equation ( 8) with δ r,σ(s) , where δ is Kronecker's delta (i.e., it is one if r = σ(s) and zero otherwise).Here, σ is a bijection that determines the opponent's action r depending on the opponent's present state s, which we define as σ(H) = C and σ(U) = D. Using this simplification heuristic, Equation ( 8) is greatly reduced to: where σ −1 is an inverse map of σ from actions to states.In Equation ( 9), the agent infers that the opponent's state changes to s t+1 , because the agent previously selected action a t and the opponent was definitely in state σ −1 (r t ).Applying a time-shifted Equation (9) to Equation (7), we obtain the BOE that the value of previous outcome (a t−1 , r t−1 ) should satisfy, i.e., where we rewrite V(w( . Here, w represents the belief regarding the opponent's present action r t , which is approximated by the agent given previous outcome (a t−1 , r t−1 ).Equation (10) reads that if the agent obeys the optimal policy, the value of having observed previous outcome (a t−1 , r t−1 ) (i.e., the left-hand side) is the expected (by approximate belief w) sum of the immediate reward and the value, discounted by β, of observing present outcome (a t , r t ) (i.e., the right-hand side).

Use of External Information (Model 2)
In Model 2, we assume that after the two players decide actions a t and r t in a game stage (now at time t + 1), the agent comes to know or correctly infers the opponent's previous state ŝt by using external information.More specifically, b t (s t ) in Equation ( 8) is replaced by δ ŝt ,s t .In this case, Equation ( 8) is reduced to: b t+1 (s t+1 ) = w(s t+1 |a t , ŝt ) Applying a time-shifted Equation (11) to Equation (7), we obtain the BOE that the value of the previous pair comprised of the agent's action a t−1 and the opponent's (inferred) state ŝt−1 should satisfy, i.e., where we rewrite V(w(•|a t , ŝt )) as V (2) (a t , ŝt ).Because we assume that the previous state inference is correct, ŝt−1 and ŝt in Equation ( 12) can be replaced by s t−1 and s t , respectively.Equation ( 12) then reads that if the agent obeys the optimal policy, the value of having observed the agent's previous action a t−1 and knowing the opponent's previous state s t−1 (i.e., the left-hand side) is the expected (by state transition distribution w) sum of the immediate reward and the value, discounted by β, of observing the agent's present action a t and getting to know the opponent's present state s t (i.e., the right-hand side).

Conditions for Optimality and Cooperation Frequencies
Overall, we are interested in the optimal policy against a given opponent, but identifying such a policy depends on the payoff structure, i.e., Equation (2).We follow the procedure below to search for a payoff structure that yields an optimal policy.
Using the obtained value function, determine the payoff conditions under which the policy is consistent with the value function, i.e., the policy is actually optimal.
Figure 2 illustrates how each model's policy uses available pieces of information.In Appendix A, we describe in detail how to calculate the value functions and payoff conditions for each model.  (1depends on the agent's previous action a and the opponent's previous action r and (c) policy π (2) depends on the agent's previous action a and the opponent's previous state s .The open solid circles represent the opponent's known states, whereas dotted circles represent the opponent's unknown states.Black arrows represent probabilistic dependencies of the opponent's decisions and state transitions.
Next, using the obtained optimal policies, we study to what extent an agent obeying the selected optimal policy cooperates in the repeated game when we assume a model comprising incomplete information (i.e., Models 1 and 2) versus when we assume a model comprising complete information (i.e., Model 0).To do so, we consider an agent and an opponent playing an infinitely repeated game.In the game, both the agent and opponent fail in selecting the optimal action with probabilities ν and , respectively.After a sufficiently large number of stages, the distribution of the states and actions of the two players converges to a stationary distribution.As described in Appendix B, we measure the frequency of the agent's cooperation in the stationary distribution.
Following the above procedure, all combinations of models, opponent types and optimal policies can be solved straightforwardly, but such proofs are too lengthy to show here; thus, in Appendix C, we demonstrate just one case in which policy CDDD (see Section 3) is optimal against a GRIM opponent in Model 2.

Results
Before presenting our results, we introduce a short hand notation to represent policies for each model.In Model 0, a policy is represented by character sequence a H a U , where a s = π (0) (s) is the optimal action if the opponent is in state s ∈ S. Model 0 has at most four possible policies, namely CC, CD, DC and DD.Policies CC and DD are unconditional cooperation and unconditional defection, respectively.With policy CD, an agent behaves in a reciprocal manner in response to an opponent's present state; more specifically, the agent cooperates with an opponent in state H, hence the opponent cooperating at the present stage, and defects against an opponent in state U, hence the opponent defecting at the present stage.Policy DC is an asocial variant of policy CD: an agent obeying policy DC defects against an opponent in state H and cooperates with an opponent in state U.We call policy CD anticipation and policy DC asocial-anticipation.In Model 1, a policy is represented by four-letter sequence a CC a CD a DC a DD , where a a r = π (1) (a , r ) is the optimal action, with the agent's and opponent's selected actions a and r at the previous stage.In Model 2, a policy is represented by four-letter sequence a CH a CU a DH a DU , where a a s = π (2) (a , s ) is the optimal action, with the agent's selected action a and the opponent's state s at the previous stage.Models 1 and 2 each have at most sixteen possible policies, ranging from unconditional cooperation (CCCC) to unconditional defection (DDDD).
For each model, we identify four classes of optimal policy, i.e., unconditional cooperation, anticipation, asocial-anticipation and unconditional defection.Figure 3 shows under which payoff conditions each of these policies are optimal, with a comprehensive description for each panel given in Appendix D. An agent obeying unconditional cooperation (i.e., CC in Model 0 or CCCC in Models 1 and 2, colored blue in the figure) or unconditional defection (i.e., DD in Model 0 or DDDD in Models 1 and 2, colored red in the figure) always cooperates or defects, respectively, regardless of an opponent's state of mind.An agent obeying anticipation (i.e., CD in Model 0, CCDC against CTFT, CCDD against TFT, CDDC against WSLS or CDDD against GRIM in Models 1 and 2, colored green in the figure) conditionally cooperates with an opponent only if the agent knows or guesses that the opponent has a will to cooperate, i.e., the opponent is in state H.As an example, in Model 0, an agent obeying policy CD knows an opponent's current state, cooperating when the opponent is in state H and defecting when in state U.In Models 1 and 2, an agent obeying policy CDDC guesses that an opponent is in state H only if the previous outcome is (C, C) or (D, D), because the opponent obeys WSLS.Since the agent cooperates only if the agent guesses that the opponent is in state H, it is clear that anticipation against WSLS is CDDC.Finally, an agent obeying asocial-anticipation (i.e., DC in Model 0, DDCD against CTFT, DDCC against TFT, DCCD against WSLS or DCCC against GRIM in Models 1 and 2, colored yellow in the figure) behaves in the opposite way to anticipation; more specifically, the agent conditionally cooperates with an opponent only if the agent guesses that the opponent is in state U.This behavior increases the number of outcomes of (C, D) or (D, C), which induces the agent's payoff in SG.
The boundaries that separate the four optimal policy classes are qualitatively the same for Models 0, 1 and 2, which is evident by comparing them column by column in Figure 3, although they are slightly affected by the opponent's errors, i.e., and µ, in different ways.These boundaries become identical for the three models in the error-free limit (see Table 5 and Appendix E).This similarity between models indicates that an agent using a heuristic or an external clue to guess an opponent's state (i.e., Models 1 and 2) succeeds in selecting appropriate policies, as well as an agent that knows an opponent's exact state of mind (i.e., Model 0).To better understand the effects of the errors here, we show the analytical expressions of the boundaries in a one-parameter PD in Appendix F. Here, the opponent obeys (a-d) CTFT, (e-h) TFT, (i-l) WSLS and (m-p) GRIM.(a,e,i,m) Error-free cases ( = µ = ν = 0) of complete and incomplete information (common to Models 0, 1 and 2).(b,f,j,n) Error-prone cases ( = µ = ν = 0.1) of complete information (i.e., Model 0).(c,g,k,o) Error-prone cases ( = µ = ν = 0.1) of incomplete information (i.e., Model 1).(c,g,k,o) Error-prone cases ( = µ = ν = 0.1) of incomplete information (i.e., Model 2).Horizontal and vertical axes represent payoffs for one-sided defection, T, and one-sided cooperation, S, respectively.In each panel, Harmony Game (HG), Snowdrift Game (SG), Stag Hunt (SH) and Prisoner's Dilemma (PD) indicate the regions of these specific games.We set parameter β = 0.8.
Although the payoff conditions for the optimal policies are rather similar across the three models, the frequency of cooperation varies.Figure 4 shows the frequencies of cooperation in infinitely repeated games, with analytical results summarized in Table 3 and a comprehensive description of each panel presented in Appendix G. Hereafter, we focus on the cases of anticipation since it is the most interesting policy class we wish to understand.In Model 0, an agent obeying anticipation cooperates with probability 1 − µ − 2ν when playing against a CTFT or WSLS opponent, with probability 1/2 when playing against a TFT opponent and with probability (µ + ν 2 (1 − 2µ))/(2µ + ν(1 − 2µ)) when playing against a GRIM opponent, where µ and ν are probabilities of error in the opponent's state transition and the agent's action selection, respectively.To better understand the effects of errors, these cooperation frequencies are expanded by the errors except for in the GRIM case.In all Model 0 cases, the error in the opponent's action selection, , is irrelevant, because in Model 0, the agent does not need to infer the opponent's present state through the opponent's action.Interestingly, in Models 1 and 2, an agent obeying anticipation cooperates with a CTFT opponent with probability 1 − 2ν, regardless of the opponent's error µ.This phenomenon occurs because of the agent's interesting policy CCDC, which prescribes selecting action C if the agent self-selected action C in the previous stage; once the agent selects action C, the agent continues to try to select C until the agent fails to do so with a small probability ν.This can be interpreted as a commitment strategy to bind oneself to cooperation.In this case, the commitment strategy leads to better cooperation than that of the agent knowing the opponent's true state of mind; the former yields frequency of cooperation 1 − 2ν and the latter 1 − µ − 2ν.A similar commitment strategy (i.e., CCDD) appears when the opponent obeys TFT; here, an agent obeying CCDD continues to try to select C or D once action C or D, respectively, is self-selected.In this case, however, partial cooperation is achieved in all models; here, the frequency of cooperation by the agent is 1/2.When the opponent obeys WSLS, the frequency of cooperation by the anticipating agent in Model 2 is the same as in Model 0, i.e., 1 − µ − 2ν.In contrast, in Model 1, the frequency of cooperation is reduced by 2 to 1 − 2 − µ − 2ν.In Model This phenomenon again occurs due to a commitment-like aspect of the agent's policy, i.e., CDDD; once the agent selects action D, the agent continues to try to defect for a long time.

Discussion and Conclusion
In this paper, we analyzed two models of repeated games in which an agent uses a heuristic or additional information to infer an opponent's state of mind, i.e., the opponent's emotions or intentions, then adopts a decision rule that maximizes the agent's expected long-term payoff.In Model 1, the agent believes that the opponent's action-selection is deterministic in terms of the opponent's present state of mind, whereas in Model 2, the agent knows or correctly recognizes the opponent's state of mind at the previous stage.For all models, we found four classes of optimal policies.Compared to the null model (i.e., Model 0) in which the agent knows the opponent's present state of mind, the two models establish cooperation almost equivalently except when playing against a GRIM opponent (see Table 3).In contrast to the reciprocator in the classical framework of the reaction norm, which reciprocates an opponent's previous action, we found the anticipator that infers an opponent's present state and selects an action appropriately.Some of these anticipators show commitment-like behaviors; more specifically, once an anticipator selects an action, the anticipator repeatedly selects that action regardless of an opponent's behavior.Compared to Model 0, these commitment-like behaviors enhance cooperation with a CTFT opponent in Model 2 and diminish cooperation with a GRIM opponent in Models 1 and 2.
Why can the commitment-like behaviors be optimal?For example, after selecting action C against a CTFT opponent, regardless of whether the opponent was in state H or U at the previous stage, the opponent will very likely move to state H and select action C. Therefore, it is worthwhile to believe that after selecting action C, the opponent is in state H, and thus, it is good to select action C again.Next, it is again worthwhile to believe that the opponent is in state H and good to select action C, and so forth.In this way, if selecting an action always yields a belief in which selecting the same action is optimal, it is commitment-like behavior.In our present study, particular opponent types (i.e., CTFT, TFT and GRIM) allow such self-sustaining action-belief chains, and this is why commitment-like behaviors emerge as optimal decision rules.
In general, our models depict repeated games in which the state changes stochastically.Repeated games with an observable state have been studied for decades in economics (see, e.g., [35,36]); however, if the state is unobservable, the problem becomes a belief MDP.In this case, Yamamoto showed that with some constraints, some combination of decision rules and beliefs can form a sequential equilibrium in the limit of a fully long-sighted future discount, i.e., a folk theorem [21].In our present work, we have not investigated equilibria, instead studying what decision rules are optimal against some representative finite-state machines and to what extent they cooperate.Even so, we can speculate on what decision rules form equilibria as follows.
In the error-free limit, the opponent's states and actions have a one-to-one relationship, i.e., H to C and U to D. Thus, the state transitions of a PFSM can be denoted as s CH s CU s DH s DU , where s a s is the opponent's next state when the agent's previous action was a and the opponent's previous state was s .Using this notation, the state transitions of GRIM and WSLS can be denoted by HUUU and HUUH, respectively.Given this, in s a ,s , we can rewrite the opponent's present state s with present action r and previous state s with previous action r by using the one-to-one correspondence between states and actions in the error-free limit.Moreover, from the opponent's viewpoint, a in s a ,s can be rewritten as s , which is the agent's pseudo state; here, because of the one-to-one relationship, the agent appears as if the agent had a state in the eyes of the opponent.In short, we can rewrite s a ,s as r r ,a in Model 1 and as r r ,s in Model 2, where we flip the order of the subscripts.This rewriting leads HUUU and HUUH to CDDD and CDDC, respectively, which are part of the optimal decision rules when playing against GRIM and WSLS; GRIM and WSLS can be optimal when playing against themselves depending on the payoff structure.The above interpretation suggests that some finite-state machines, including GRIM and WSLS, would form equilibria in which a machine and a corresponding decision rule, which infers the machine's state of mind and maximizes the payoff when playing against the machine, behave in the same manner.
Our models assume an approximate heuristic or ability to use external information to analytically solve the belief MDP problem, which can also be numerically solved using the Partially-Observable Markov Decision Process (POMDP) [32].Kandori and Obara introduced a general framework to apply the POMDP to repeated games of private monitoring [20].They assumed that the actions of players are not observable, but rather players observe a stochastic signal that informs them about their actions.In contrast, we assumed that the actions of players are perfectly observable, but the states of players are not observable.Kandori and Obara showed that in an example of PD with a fixed payoff structure, grim trigger and unconditional defection are equilibrium decision rules depending on initial beliefs.We showed that in PD, CDDD and DDDD decision rules in Models 1 and 2 are optimal against a GRIM opponent in a broad region of the payoff space, suggesting that their POMDP approach and our approach yield similar results if the opponent is sufficiently close to some representative finite-state machines.
Nowak, Sigmund and El-Sedy performed an exhaustive analysis of evolutionary dynamics in which two-state automata play 2 × 2-strategy repeated games [37].The two-state automata used in their study are the same as the PFSMs used in our present study if we set = 0 in the PFSMs, i.e., if we consider that actions selected by a PFSM completely correspond with its states.Thus, their automata do not consider unobservable states.They comprehensively studied average payoffs for all combinations of plays between the two-state automata in the noise-free limit.Conversely, we studied optimal policies when playing against several major two-state PFSMs that have unobservable states by using simplified belief calculations.
In the context of the evolution of cooperation, there have been a few studies that examined the role of state-of-mind recognition.Anh, Pereira and Santos studied a finite population model of evolutionary game dynamics in which they added a strategy of Intention Recognition (IR) to the classical repeated PD framework [18].In their model, the IR player exclusively cooperates with an opponent that has an intention to cooperate, inferred by calculating its posterior probability using information from previous interactions.They showed that the IR strategy, as well as TFT and WSLS, can prevail in the finite population and promote cooperation.There are two major differences between their model and ours.First, their IR strategists assume that an individual has a fixed intention either to cooperate or to defect, meaning that their IR strategy only handles one-state machines that always intend to do the same thing (e.g., unconditional cooperators and unconditional defectors).In contrast, our model can potentially handle any multiple-state machines that intend to do different things depending on the context (e.g., TFT and WSLS).Second, they examined the evolutionary dynamics of their IR strategy, whereas we examined the state-of-mind recognizer's static performance of cooperation when using the optimal decision rule against an opponent.
Our present work is just a first step to understanding the role of state-of-mind recognition in game theoretic situations, thus further studies are needed.For example, as stated above, an equilibrium established between a machine that has a state and a decision rule that cares about the machine's state could be called 'theory of mind' equilibrium.A thorough search for equilibria here is necessary.Moreover, although we assume it in our present work, it is unlikely that a player knows an opponent's parameters, i.e., φ and w.An analysis of models in which a player must infer an opponent's parameters and state would be more realistic and practical.Further, our present study is restricted to a static analysis.The co-evolution of the decision rule and state-of-mind recognition in evolutionary game dynamics has yet to be investigated.Agent's belief regarding the opponent's state at stage t, where s is the opponent's state at stage t V (i)  Value function in Model i (= 0, 1 or 2) Agent's optimal policy in Model i (= 0, 1 or 2) p (i) (a, s) Stationary joint distribution of the agent's action a and the opponent's state s in Model i (= 0, 1 or 2) Frequency of cooperation by the agent in Model i (= 0, 1 or 2) Table 5. Optimal policies and their conditions for optimality (error-free limit).

Opponent
Policy Condition for Optimality (CD and its variants, depending on opponent type) are optimal in HG and SH, respectively.With long-sighted future discounts, the regions in which either of them is optimal (i.e., the blue or green regions) broaden and can be optimal in SG or PD.
If the opponent obeys CTFT (see the CTFT row of Table 5 and Figure 3a for the error-free limit), unconditional cooperation (CC or CCCC) can be optimal in HG and some SG; further, anticipation (CD or CCDC) can be optimal in SH and some PD.Numerically-obtained optimal policies in the error-prone case are shown in Figure 3b-d ( , µ, ν = 0.1).With a small error, the regions in which the policies are optimal slightly change from the error-free case.With a fully-long-sighted future discount (i.e., β → 1), the four policies that can be optimal when β < 1 can be optimal (see the CTFT row, β → 1 column of Table 5).
If the opponent obeys TFT (see the TFT row of Table 5 and Figure 3e for the error-free limit), unconditional cooperation (CC or CCCC) can be optimal in all four games (i.e., HG, some SG, SH and some PD), while asocial-anticipation (DC or DDCC) can be optimal in some SH and some PD, but the region in which anticipation is optimal falls outside of the drawing area (i.e., −1 < S < 1 and 0 < T < 2) in Figure 3e.Numerically-obtained optimal policies in the error-prone case are shown in Figure 3f-h ( , µ, ν = 0.1).With a fully-long-sighted future discount (i.e., β → 1), among the four policies that can be optimal when β < 1, only unconditional cooperation (CC or CCCC) and asocial-anticipation (DC or DDCC) can be optimal (see the TFT row, β → 1 column of Table 5).
If the opponent obeys WSLS (see the WSLS row of Table 5 and Figure 3i for the error-free limit), unconditional cooperation (CC or CCCC) can be optimal in some HG and some SG, and anticipation (CD or CDDC) can be optimal in all four games (some HG, some SG, SH and some PD).Numerically-obtained optimal policies in the error-prone case are shown in Figure 3j-l ( , µ, ν = 0.1).With a fully-long-sighted future discount (i.e., β → 1), among the four policies that are optimal when β < 1, only anticipation (CD or CDDC) and unconditional defection (DD or DDDD) can be optimal (see the WSLS row, β → 1 column of Table 5).
If the opponent obeys GRIM (see the GRIM row of Table 5 and Figure 3m for the error-free limit), unconditional cooperation (CC or CCCC) can be optimal in some HG and some SG and anticipation (CD or CDDD) can be optimal in SH and PD.Numerically-obtained optimal policies in the error-prone case are shown in Figure 3n-p ( , µ, ν = 0.1).With a fully-long-sighted future discount (i.e., β → 1), among the four policies that are optimal when β < 1, only unconditional cooperation (CC or CCCC) and anticipation (CD or CDDD) can be optimal (see the GRIM row, β → 1 column of Table 5).

Appendix E. Isomorphism of the BOEs in the Error-Free Limit
In the error-free limit, an opponent's action selection and state transition are deterministic; i.e., using maps σ and ψ, we can write r t = σ(s t ) and s t+1 = ψ(a t , s t ) for any stage t.Thus, Equation (5) becomes: V (0) (s) = max a f (a, σ(s)) + βV (0) (ψ(a, s)) Similarly, Equations ( 10) and ( 12) become: V (1) (a , σ(s )) = max a f (a, σ(s)) + βV (1) (a, σ(s)) (36) and: V (2) (a , s ) = max a f (a, σ(s)) + βV (2) (a, s) respectively, where s = ψ(a , s ).Each right-hand side of Equations ( 36) and ( 37) depends only on s, thus using some v, we can rewrite V (1) (a , σ(s )) and V (2) (a , s ) as v(s) = v(ψ(a , s )).This means that the optimal policies obtained from Equations ( 36) and ( 37) are isomorphic to those obtained from Equation (35) in the sense that corresponding optimal policies have an identical condition for optimality.As an example, policy CC in Model 0 and policy CCCC in Models 1 and 2 are optimal against a CTFT opponent under identical conditions, S > 0 ∧ T < 1 + β(1 − S) (see Table 5).Because Equation (35) yields at most four optimal policies (i.e., CC, CD, DC or DD), Equations ( 36) and ( 37) also yield at most four optimal policies.as shown in Table 3.Here, g 0 's asymptotic behavior about ν and µ depends on how we assume the order of errors ν and µ.If the error in the agent's action is far less than the error in the opponent's state transition (i.e., ν → 0), then we obtain g 0 → 1/2.If the error in the opponent's state transition is far less than the error in the agent's action (i.e., µ → 0), then we obtain g 0 → ν.If the two errors have the same order (i.e., µ = cν for some constant c and ν → 0), then we obtain g 0 → c/(2c + 1).For anticipation in Models 1 and 2, once the CDDD agent selects action D, the agent continues to try to select D, which is why the CDDD agent's cooperation is incurable, i.e., (39) and: Whatever the fraction terms in g 1 and g 2 are, g 1 and g 2 are O(ν) if , µ and ν are finite.For asocial-anticipation (DC or DCCC), a mechanism opposite of the above works, and the agent is mostly cooperative.

Figure 2 .
Figure 2. Different policies of the models.Depending on the given model, the agent's decision depends on different information as indicated by orange arrows.At the present stage in which the two players are deciding actions a and r, (a) policy π (0) depends on the opponent's present state s, (b) policy π(1) depends on the agent's previous action a and the opponent's previous action r and (c) policy π(2) depends on the agent's previous action a and the opponent's previous state s .The open solid circles represent the opponent's known states, whereas dotted circles represent the opponent's unknown states.Black arrows represent probabilistic dependencies of the opponent's decisions and state transitions.

Table 2 . Available information about the opponent. Model Opponent's Model Opponent's State φ w Previous Present
1, because the opponent can mistakenly select an action opposite to what the opponent's state dictates, the agent's guess regarding the opponent's previous state could fail.This misunderstanding reduces the agent's cooperation if the opponent obeys WSLS;

Table 3 .
Frequencies of cooperation in infinitely repeated games.Here,