Modeling Theory of Mind in Dyadic Games Using Adaptive Feedback Control

Freire, Ismael T.; Arsiwalla, Xerxes D.; Puigbò, Jordi-Ysard; Verschure, Paul

doi:10.3390/info14080441

Open AccessArticle

Modeling Theory of Mind in Dyadic Games Using Adaptive Feedback Control

¹

Donders Institute for Brain, Cognition and Behaviour, Radboud University, 6525 AJ Nijmegen, The Netherlands

²

Department of Information and Communication Technologies, Universitat Pompeu Fabra, 08018 Barcelona, Spain

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Information 2023, 14(8), 441; https://doi.org/10.3390/info14080441

Submission received: 30 June 2023 / Revised: 31 July 2023 / Accepted: 1 August 2023 / Published: 4 August 2023

(This article belongs to the Special Issue Intelligent Agent and Multi-Agent System)

Download

Browse Figures

Versions Notes

Abstract

A major challenge in cognitive science and AI has been to understand how intelligent autonomous agents might acquire and predict the behavioral and mental states of other agents in the course of complex social interactions. How does such an agent model the goals, beliefs, and actions of other agents it interacts with? What are the computational principles to model a Theory of Mind (ToM)? Deep learning approaches to address these questions fall short of a better understanding of the problem. In part, this is due to the black-box nature of deep networks, wherein computational mechanisms of ToM are not readily revealed. Here, we consider alternative hypotheses seeking to model how the brain might realize a ToM. In particular, we propose embodied and situated agent models based on distributed adaptive control theory to predict the actions of other agents in five different game-theoretic tasks (Harmony Game, Hawk-Dove, Stag Hunt, Prisoner’s Dilemma, and Battle of the Exes). Our multi-layer control models implement top-down predictions from adaptive to reactive layers of control and bottom-up error feedback from reactive to adaptive layers. We test cooperative and competitive strategies among seven different agent models (cooperative, greedy, tit-for-tat, reinforcement-based, rational, predictive, and internal agents). We show that, compared to pure reinforcement-based strategies, probabilistic learning agents modeled on rational, predictive, and internal phenotypes perform better in game-theoretic metrics across tasks. The outlined autonomous multi-agent models might capture systems-level processes underlying a ToM and suggest architectural principles of ToM from a control-theoretic perspective.

Keywords:

theory of mind; multi-agent systems; game theory; cognitive architectures; reinforcement learning

1. Introduction

How do autonomous social agents model the goals, beliefs, and actions of other agents they interact with in complex social environments? This has long been a central question in cognitive science and philosophy of mind. It forms the basis of what is known as Theory of Mind (ToM), a long-standing problem in cognitive science concerned with how cognitive agents form predictions of mental and behavioral states of other agents in the course of social interactions [1,2]. The precise mechanisms by which the brain achieves this capability are not entirely understood. Furthermore, how a ToM can be embodied in artificial agents is very relevant for the future course of artificial intelligence (AI), especially with respect to human-machine interactions. What can be said concerning architectural and computational principles necessary for a Theory of Mind? As a small step in that direction, we propose and validate control-based cognitive architectures to predict actions and models of other agents in five different game theoretic tasks.

It is known that humans also use their ToM to attribute mental states, beliefs, and intentions to inanimate objects [3], suggesting that the same mechanisms governing ToM may be integral to other aspects of human cognition. We argue that in order to understand both how ToM is implemented in biological brains and how it may be modeled in artificial agents, one must first identify its architectural principles and constraints from biology.

Several notable approaches addressing various aspects of this problem already exist, particularly those based on artificial neural networks [4,5,6]. Recent work in [7] on Machine Theory of Mind is one such example. This work proposes that some degree of theory of mind can be autonomously obtained from Deep Reinforcement Learning (DRL). This deep learning approach can be made symmetric by allowing the agents to observe each other [8] and training them to develop a mutual Theory of Mind. However, the use of Black-Box Optimization (BBO) algorithms in DRL hinders the understanding one can obtain at the mechanistic level. Because BBO algorithms are able to approximate any complex function (as suggested for Long-Short Term Memory (LSTM)-based neural networks [9]), one cannot use that to decipher specific mechanisms that may underlie a ToM in these systems.

Another interesting approach to modeling ToM follows the work of [10,11,12,13,14] on Bayesian Theory of Mind. These methods are cognitively-inspired and suggest the existence of a “psychology engine” in cognitive agents to process ToM computations. Nevertheless, the challenge for this approach remains in explaining how the computational cost of running these models might be implemented in biological substrates of embodied and situated agents.

ToM can also be modeled as inverse reinforcement learning (IRL), a popular machine learning method designed to operate within the Markov Decision Process (MDP) framework [15]. This approach, known as ‘ToM as Inverse Reinforcement Learning’, aims at predicting other agents’ actions by simulating an RL model with the hypothesized agent’s beliefs and desires, while mental-state inference is achieved by inverting this model [16]. This approach to ToM has seen a resurgence of interest in the machine learning community [17,18]. However, one of the key challenges for IRL-based ToM models is to improve their ability to handle vast and complex state spaces, a requirement for real-world success [18]. Moreover, while most IRL approaches assume a simplified form of human rationality, it is important to note that humans often deviate from these assumptions [19,20].

A third approach to this problem follows from work on autonomous multi-agent models (see [21] for a recent survey). This approach has its roots in statistical machine learning theory and robotics. It includes agent models capable of policy reconstruction, type-classification, planning, recursive reasoning, and group modeling. It focuses on multi-agent communication and cooperation tasks where the ToM agent not only observes but also acts in the environment [22,23]. In a sense, this is closest to the approach we take in this work, even though our models are grounded in cognitive control systems. Open challenges in the field of autonomous multi-agent systems include modeling fully embodied agents that operate with only partial observability of their environment and are flexibly able to learn across tasks (including meta-learning) when interacting with multiple types of other agents.

The cognitive agent models we propose here advance earlier work on Control-based Reinforcement Learning (CRL) [24], where we studied the formation of social conventions in the Battle of the Exes game. The CRL model implements a feedback control loop handling the agent’s reactive behaviors (pre-wired reflexes), along with an adaptive layer that uses reinforcement learning to maximize long-term reward. We showed that this model was able to simulate human behavioral data in the above-mentioned coordination game.

The main contribution of the current paper is to advance real-time learning and control strategies to model agents with specific phenotypes that can learn to either model or predict the opposing agents’ actions. Using top-down predictions from the adaptive to reactive control layers and bottom-up error feedback from reactive to adaptive layers, we test cooperative and competitive strategies for seven different multi-agent behavioral models. The purpose of this study is to understand how the architectural assumptions behind each of these models impact agent performance across standard game-theoretic tasks and what this implies for the development of autonomous intelligent systems.

2. Methods

2.1. Game Theoretic Tasks

Benchmarks inspired by game theory have become standard in the multi-agent reinforcement learning literature [4,5,7,25]. However, most of the work developed in this direction presents models that are tested in one single task or environment [24,26,27,28,29,30,31], at best two [32,33]. This raises a fundamental question about how these models generalize to deal with a more general or diverse set of problems. So far, this approach does not readily enable one to extract principles and mechanisms or unravel the dynamics underlying human cooperation and social decision-making.

In this work, we want to go a step further and propose a five-task benchmark for predictive models based on classic normal-form games extracted from game theory: the Prisoner’s Dilemma, the Harmony Game, the Hawk-Dove, the Stag Hunt, and the Battle of the Exes.

In their normal form, games are depicted in a matrix that contains all the possible combinations of actions that players can choose, along with their respective rewards (Table 1). The most common among these games are dyadic games, known as social dilemmas, in which players have to choose between two actions: cooperate or defect. One action (cooperate) is more generous and renders a good amount of reward to each player if both choose it, but it gives very poor results if the other player decides to defect. On the other hand, the second action (defect) provides a significant individual reward if it is taken alone, but a very small one if both choose it.

Although simple in nature, these dyadic games are able to model key elements of social interaction, such as the tension between the benefit of cooperative action and the risk (or temptation) of free-riding. This feature can be described in a general form, such as:

Table 1. A social dilemma in matrix-form.

	Cooperate	Defect
Cooperate	R, R	S, T
Defect	T, S	P, P

This matrix represents the outcomes of all possible combinations of actions between the row player and the column player. R stands for “reward for mutual cooperation”; T for “temptation of defecting”; S for the “sucker’s payoff for non-reciprocated cooperation”; and P for “punishment for mutual defection”. By manipulating the relationship between the values of this matrix, many different situations can be obtained that vary in terms of what would be an optimal solution (or Nash equilibria [34]).

The Prisoner’s Dilemma (Table 2) represents the grim situation in which the temptation of defecting (T) is more rewarding than mutual cooperation (R), and the punishment for mutual defection (P) is still more beneficial than a failed attempt of cooperation (S). This relationship among the possible outcomes can be stated as

T > R > P > S

. Mutual defection is the only pure Nash equilibrium in this game since there is no possibility for any player to be better off by individually changing its own strategy.

On the Stag Hunt (Table 3), we face a context in which mutual cooperation (R) gives better results than individual defection (T), but at the same time, a failed cooperation (S) is worse than the punishment for mutual defection (P). In this case,

R > T > P > S

, there are two Nash equilibria: mutual cooperation and mutual defection.

The Hawk-Dove game (Table 4) presents a scenario in which temptation (T) is more rewarding than cooperation (R), but a mutual defection (P) is less desirable than non-reciprocated cooperation (S). So, for a relationship of

T > R > S > P

like this one, there are three Nash equilibria: two pure anti-coordination equilibria, in which each player always chooses the opposite action of their opponent, and one mixed equilibrium, in which each player probabilistically chooses between the two pure strategies.

The fourth type of social dilemma is the Harmony Game (Table 5). In this case (

R > T > S > P

), the game has only one pure equilibrium, pure cooperation, since mutual cooperation (R) renders better outcomes than the temptation to defect (T), and also the penalty for failing to cooperate (S) is less than the punishment for mutual defection (P).

Finally, as the last game, we will introduce a coordination game, the Battle of the Exes [35,36]. In this type of game, the main goal is to achieve coordination between two players (either congruent coordination on the same action or incongruent coordination upon choosing different actions) since the failure to do so is heavily penalized (see Table 6). Following the previous nomenclature applied in the social dilemmas, we could define this game as

T > S

;

R, P = 0

. This game has two pure-dominance equilibria, in which one player chooses the more rewarding action and the other the low rewarding action, and a turn-taking equilibrium, in which players alternate over time in choosing the more rewarding action.

This selection of games provides enough variability to test how learning agents can perform across different contexts, so we avoid problems derived from over-fitting on a specific payoff distribution or related to the possibility of a model to exploit/capitalize on certain features of a game that could not have been predicted beforehand. Moreover, a similar selection of games has been tested in human experiments, proving to be sufficient to classify a reduced set of behavioral phenotypes across games in human players [37,38]. Using the above games as benchmarks, below we describe the control-theoretic framework and seven agent models that capture phenotypes relevant for testing ToM capabilities across games.

2.2. Control-Based Reinforcement Learning

Our starting point is the Control-based Reinforcement Learning (CRL) model presented in previous work [24]. The CRL is a biologically grounded cognitive model composed of two control layers (Reactive and Adaptive, see Figure 1), and based on the principles of the Distributed Adaptive Control (DAC) theory [39]. The Reactive layer represents the agent’s pre-wired reflexes and serves for real-time control of sensorimotor contingencies. The Adaptive layer endows the agent with learning abilities that maximize long-term reward by choosing which action to perform in each round of the game. This layered structure allows for top-down and bottom-up interactions between the two layers [40], resulting in an optimal control at different time-scales: within each round of play and across rounds [24].

The Reactive layer is inspired by Valentino Braitenberg’s Vehicles [41] and presents the intrinsic mechanic behaviors that are caused by a direct mapping between the sensors and the motors of the agent. The two reactive behaviors are based on a cross excitatory connection and a direct inhibitory connection between the reward sensors and the two motors. This results in an approaching behavior in which the agents will turn towards a reward location increasingly faster the closer they detect it. Since the agents are equipped with two sets of sensors specifically tuned for detecting each reward location, they also have two instances of this reactive behavior, one for the ’cooperative’ and one for the ‘individual’ reward (see Figure 1, green box).

The Adaptive layer is in charge of prediction, learning, and decision-making. Each of the agent models described in the next section instantiates a different Adaptive layer. Its main role is to learn from previous experience and to decide at the beginning of each round which action the agent will take. When playing normal-form games, the Adaptive layer receives as inputs the state of the environment, which corresponds with the outcome of the last round of the game, and the reward obtained because of that final state. This information is used to produce an action as the output. For all experiments conducted in this paper, the possible states s can be either ‘R’, ‘S’, ‘T’ or ‘P’ (see Table 1), and determines what reward the agent will obtain based on the specific payoff matrix of the game being played, as specified in Table 1, Table 2, Table 3, Table 4, Table 5 and Table 6. The possible actions a are two: “cooperate” or “defect”.

Depending on the action that the Adaptive layer selects, there is a top-down selective inhibition that affects the reactive behaviors of the Reactive layer. If the action selected is“cooperate”, the “approach defect location” reactive behavior is inhibited, thus focusing the agent’s attention only on the cooperative reward and on the other agent. Conversely, if the action selected is “defect”, the reactive behavior inhibited will be “approach cooperate location”. This mechanism aims to mimic how biological systems execute top-down control over a hierarchy of different control structures [42,43,44].

The error monitoring function works in real-time as a bottom-up mechanism that signals an error function from the Reactive layer sensors to the Adaptive layer decision-making module. The error signal is only triggered when the agent detects an inconsistency between its initial prediction of the opponent’s action and the real-time data obtained by its sensors. If a prediction error occurs, the error monitoring function will update the current state of the agent, and this will make the decision-making module output a new action. Along with the top-down inhibitory control, this module is inspired by evidence from cognitive science about the role of bottom-up sensory stimuli in generating prediction errors [42,45,46].

2.3. Agent Models

In this section, we describe the four learning models that we will compare in this study. All of them use the classic TD-learning algorithm as the fundamental component of their cognitive architectures. The different combinations of this element with other functions allow these models to learn to predict their opponent’s actions and adapt its behavior accordingly.

2.3.1. TD-Learning Model

The TD-learning model (see Figure 2) uses an Actor-Critic version of the Temporal-Difference (TD) learning algorithm [47] for maximizing long-term rewards (see [24] for a detailed description of the implementation).

In brief, for each observed state

s \in S

, the TD-learning algorithm selects an action

a \in A

according to a given policy

P (a = a_{t} | s = s_{t - 1})

. Once a round of play is finished, the reward

r (s_{t})

obtained by the agent will update the TD-error e signal, following:

e (s_{t}) : = r (s_{t}) + γ V_{Π} (s_{t}) - V_{Π} (s_{t - 1})

where

γ

is a discount factor set to

0.99

. and

V_{π} (s_{t}) = γ r (s_{t})

is the Critic. Finally, the policy (or Actor) will be updated following

Π (a_{t}, s_{t - 1}) : = Π (a_{t}, s_{t - 1}) + δ e (s_{t - 1})

, where

δ

is a learning rate that is set to

0.15

. The experiments conducted in this paper use the same parameter values as [24], where this parameter configuration showed the best fit to human behavioral data when playing the Battle of the Exes under similar experimental conditions.

2.3.2. Rational Model

The Rational model is a predictive model that represents an ideal perfectly-rational and self-interested player provided that its predictions are correct. Its function is to serve as a benchmark for the other predictive models since once it learns to predict its opponent’s actions accurately; it will always respond automatically with the optimal response to that predicted action. It is composed of two main functions: a prediction module and a deterministic utility maximization function (see Figure 3). The first module learns to predict the next action of the opponent using the opponent’s previous state as input. Once the prediction is made, the second module calculates the action that will render the highest reward assuming that the opponent has chosen the predicted action.

The predictive module also uses a TD-learning algorithm but substitutes the explicit reward for an implicit reward that comes from the error in predicting the other agent’s action. We call this reward ‘implicit’ because it is calculated internally and is based on whether the action of the opponent was accurately predicted or not. That way, an accurate prediction will render a positive reward, while an incorrect one will receive a negative one.

The implicit reward signal is calculated by the function:

R_{i} = \{\begin{matrix} 1, & if a_{o} (t) = {\hat{a}}_{o} (t - 1) \\ - 1, & otherwise \end{matrix}

(1)

2.3.3. Predictive Model

The Predictive model has two distinct RL modules: one that learns to predict and one that chooses an action based on that prediction (see Figure 4). The predictive RL algorithm is identical to the one used by the Rational agent described above, so it also learns from the implicit reward described in Equation (1).

The prediction generated at the beginning of each round by the predictive-RL module is sent as an input to the second RL algorithm in charge of decision-making. The decision-making RL module uses the combined information of the opponent’s predicted action and the state of the previous round to learn the optimal action. For its update function, it uses the explicit reward obtained in that round of the game.

2.3.4. Internal Model

The Internal Model is also composed by RL algorithms, one predictive and one for learning the optimal policy (see Figure 5). In terms of function, the decision-making RL module is identical to the one used by the Predictive model: it integrates both the previous state of the game and the opponent’s predicted action in order to learn the optimal policy.

However, it differs from the Predictive model in the way its first module is designed. In this case, the predictive−RL algorithm is updated by the explicit reward that the opponent has obtained in an attempt to create an internal model of the other agent’s policy. At the functional level, it is very similar to the Predictive model, but with an important difference: while the Predictive model tries to predict the opponent’s action by focusing on its overt behavior, the Internal Model does it trying to learn the internal policy of its opponent.

2.3.5. Deterministic Agent Models

In order to test the correct functioning of the predictive capacities of the ToM agents described above, we have implemented a number of agents with a fixed behavior or policy. The first two represent two behavioral phenotypes observed in humans [37] while the third one is a classic benchmark in game theory [48].

Greedy

This model implements a simple behavioral strategy of pure self-utility maximization: it always chooses the action that renders the highest reward to itself without taking into account the opponent’s action. So for the games described in this work, it will always choose to ‘defect’ with the exception of the Stag Hunt and the Harmony Game, where the cooperative reward can give a higher reward than the temptation to defect (

R > T

).

π_{g r e e d y} = \{\begin{matrix} c o o p e r a t e, & if R > T \\ d e f e c t, & otherwise \end{matrix}

(2)

Cooperative/Nice

As a deterministic counterpart of the Greedy model, the Nice model executes an even simpler deterministic strategy: it will always choose cooperation.

π_{n i c e} = P (a = cooperate | s_{t}) = 1

(3)

Tit-for-Tat

This simple yet powerful strategy became popular in the famous Axelrod’s tournament [49] where it won against all competing algorithms and strategies in a contest based on the Iterated Prisoner’s Dilemma. We have introduced this opponent because we expect that this agent will give problems to the predictive models. Tit-for-tat always starts by cooperating, and from then on, it always chooses the last action made by its opponent. Since its capacity to switch actions does not depend on a specific policy, it cannot be learned by just taking into account variables such as its previous state or its reward. It can be described by the following equation, where

a_{opponent}

is the action made by the opponent at

t - 1

:

π_{t f t} = \{\begin{matrix} c o o p e r a t e, & if t = 0 \\ a_{opponent} (t - 1), & otherwise \end{matrix}

(4)

2.4. Experimental Setup

First, we perform four different experiments in which we test the four learning models described above (TD-learning, Rational, Predictive, Internal) in each of the five games described in Section 2.1 (Prisoner’s Dilemma, Hawk-Dove, Stag Hunt, Harmony Game, and Battle of the Exes) and against opponents of different levels of complexity (Greedy, Nice, Tit-for-tat, and TD-learning), resulting in a 4 × 5 × 4 experimental setup. This way, in experiment one, all models are tested against the Greedy agent; in experiment two, against the Nice agent; in experiment three, against the Tit-for-tat agent; and in experiment four, against the TD-learning agent.

In all experiments, each model plays 50 times each of the five games during a total amount of 1000 rounds per iteration (i.e., 5 games, 50 dyads per game, 1000 rounds per dyad). On each iteration, every model starts learning from scratch, with no previous training period. After that, we perform a fifth experiment to study a real-time spatial version of the games’ special cases in which the models encountered problems in the previous setting. So, in the first four experiments, we use the discrete-time version of the games, and for the fifth experiment, a continuous-time version.

In the discrete-time version studied in Experiments 1 to 4, both agents simultaneously choose an action at the beginning of each round. Immediately after that, the outcome of the round is calculated, and each agent receives a reward equal to the value stated in the payoff matrix of the game.

In the continuous-time version studied in Experiment 5, we implement the games in a 2D simulated robotic environment (see Figure 6). The two agents can choose to go to one of the two equally distant locations that represent the two actions of a 2-action matrix-form game. In this version, agents also choose an action at the beginning of the round, but then they have to navigate towards the selected location until one of them reaches its destination. The real-time dimension of this implementation allows the possibility of changing the action during the course of the round as well as receiving feedback in real-time. Each round of the game begins with both agents in their initial positions, as shown in Figure 6B. When the agents reach a reward location, the round ends, the rewards are distributed accordingly to the payoff matrix, and the game restarts with the agents back in their starting positions. The agents of the continuous version are embodied in virtual ePuck robots (see Figure 6A). They are equipped with two motors (for the left and right wheels) and three pairs of proximity sensors; one pair is specialized in detecting the other agent, and the other two in detecting the two distinct reward locations (‘cooperate’ or ‘defect’).

3. Results

In order to study the performance of the four agents (TD-learning, Rational, Predictive, and Internal), we focus on three aspects of the interaction: Efficacy, Stability and Prediction Accuracy. Efficacy tells us how competent the models were in obtaining rewards on average and in relation to each other. It is computed by averaging the total score obtained per game by each agent (i.e., compute the accumulated score over 1000 rounds of game for each agent, averaged by the 50 agent models of the same kind). Stability measures how predictable the behavior was or, equivalently, whether the models converged to a common strategy or alternated between non-deterministic states. In other words, Stability quantifies how predictable are the outcomes of the following rounds based on previous results by using the information-theoretic measure of surprisal (also known as self-information), which can be defined as the negative logarithm of the probability of an event [50]. By computing the average surprisal value of a model over time, we can visually observe how much did it take for a model to converge to a stable strategy and for how long it was able to maintain it for more details on the computation of surprisal see [24,35]. Finally, Prediction Accuracy gives us a better understanding of how well the predictive models were able to actually predict their opponent’s behavior. It is measured by computing for each agent the accumulated prediction errors (i.e., the number of times the agent fails to predict the opponent’s action) across the 1000 rounds of a game and then calculating the running average over the total sample (n = 50) of each predictive model.

In the context of each conducted experiment, the findings are presented in their corresponding figures, wherein the above three key metrics are examined: Efficacy in the left panels, Stability in the center panels, and Prediction Accuracy in the right panels. The reported results show the mean values obtained from a sample size of n = 50 for each game and condition.

3.1. Experiment 1: Versus a Deterministic-Greedy Agent

The results of this experiment show that, on average, all models were equally effective since they obtained a similar amount of rewards (see Figure 7). The Rational model only achieves a slightly better performance than the rest in the Stag Hunt and the Harmony Game, so this is a sign that the rest of the models were performing almost optimally. Of course, if we pay attention to the overall score obtained among the five games, we can observe a significant difference between the performance obtained by all models in the Harmony Game and the Stag Hunt when compared to the other three games. This salient difference is not due to a malfunctioning of the predictive models, as we can see by the results in Stability and Prediction Accuracy. It is caused by the constraints imposed by the Greedy opponent strategy in those games were

R > T

, since they are the ones in which it always chooses to ‘cooperate’. We see that all models learn to predict or to adapt their strategy to this opponent. In terms of Stability, we can observe how Rational Agents outperform all the other models. This result is not surprising, given the simplicity of the opponent’s greedy strategy. Against these simple, predictable agents, we consider the Rational agent a reference point of optimal performance, against which the other models can be compared. In this line, it is relevant to note that overall, both Predictive and Internal models converge to a stable state faster than the vanilla TD-learning model, who lacks the ability to predict the opponent’s action. In terms of Prediction Accuracy, no significant difference is observed among predictive models.

3.2. Experiment 2: Versus a Deterministic-Nice Agent

In this experiment, the overall Efficacy of all models has increased significantly compared to that of the previous experiment (see Figure 8). The explanation was again that all models were able to successfully learn the optimal policy against a simple deterministic strategy such as the one exhibited by the Nice agent. However, in this case, the best response strategy against a Nice agent is the one that renders the highest possible benefit in all games, as opposed to the best response strategy against the Greedy model, that on many occasions, was sub-optimal. In terms of Stability, we see again how the Predictive and the Internal agents perform slightly better than the baseline TD-learning. This time the difference is smaller, but since predicting the opponent’s behavior, in this case, is almost trivial, we do not find this result unexpected. Again we can also observe how the TD-learning, the Predictive, and the Internal agents differ from optimal by comparing their convergence rate with that of the Rational agent.

3.3. Experiment 3: Versus a Tit-for-Tat Agent

The results of the models against the Tit-for-tat agent show a significant amount of variability in terms of Efficacy (see Figure 9). The most remarkable result in this aspect is the superior Efficacy of the non-predictive TD-learning model in the Battle of the Exes. This model is able to beat the more complex predictive models in this case precisely because the predictive models encounter more problems in accurately predicting the TFT agent. This accumulated error in prediction, in turn, drives their behavior toward a more unstable regime than the one achieved by the TD-learning model. Moreover, given that this model has a more reduced state-space than the Predictive models and it only ’cares’ about the previous state of the game, it capitalizes better than the rest on the anti-coordination structure of the Battle of the Exes.

In terms of Stability, predictive models perform better than standard TD-learning in the Prisoners Dilemma and in the Harmony Game, but this result is reversed in the Stag Hunt and the Hawk-Dove and the Battle of the Exes, where the TD-learning agent reaches lower levels of surprisal, therefore indicating higher Stability. We can see how in the cases that the game has one pure Nash equilibrium (the Prisoner’s Dilemma and the Harmony Game), the predictive models perform as well in terms of Stability as against the simpler deterministic agents studied before. However, in games with a wider variety of equilibria, such as the Stag Hunt or the Hawk-Dove, predictive models have overall more difficulties in falling into a stable state with the TFT opponent.

Finally, the most salient feature regarding the results of Prediction Accuracy is the elevated accumulation of errors of the Rational model in the Stag Hunt and the Battle of the Exes, which almost reached 50%. These results may seem counter-intuitive, but actually, they highlight one of the weak points of the predictive models and the strength of the tit-for-tat strategy. As stated in the model’s description, the TFT agent selects its action based on the last action performed by its opponent, so attempting to predict a policy of such an agent can become impossible for a predictive agent unless the dyad falls into an equilibrium state very soon or if the predictive agent was able to integrate its own action into its prediction algorithm.

3.4. Experiment 4: Versus the TD-Learning Agent

In this experiment, we also observe a similar performance of the models in terms of their Efficacy in obtaining rewards, with the exception of the Battle of the Exes, where the Predictive and Internal models achieve the best results (see Figure 10). Remarkably, the Rational model achieves a lower score than the rest. This is due to the fact that this model has fallen in a pure-dominance equilibrium with the TD-learning model, where the best response for him is to perform the low-rewarding action. Again, in this case, the TD-learning model capitalizes on the fact that it is solely driven by its own self-utility maximization function, so it had the initiative to go to the higher-rewarding action faster than the Rational model. This, on the other hand, is predicting accurately and responding in the best optimal way to that accurate prediction. However, in this case, this behavioral strategy makes it fall victim to the pure-dominance equilibrium enforced by its opponent.

Regarding Stability, the results show that overall the Predictive and the Internal models converge faster than the vanilla TD-learning toward an equilibrium. In all cases, we can observe that these two predictive models fall in between the ’optimal’ level of convergence of the Rational model and the more unstable level of the non-predictive TD-learning model. The results of the Prediction Accuracy show the worse overall performance of the four experiments studied so far. This, however, was an expected outcome since the predictive agents were facing a non-deterministic learning agent, which is objectively more difficult to predict.

3.5. Experiment 5: Continuous-Time Effects on Prediction Accuracy

As an additional control experiment, we investigate whether the instances where the Rational agent showed poor Prediction Accuracy against the TFT agent in Experiment 3 are solely attributed to the discrete-time nature of ballistic games or if similar errors persist under continuous-time interactions with real-time feedback.

The findings presented in Figure 11 demonstrate that Prediction Accuracy is indeed restored under real-time conditions. This improvement is attributed to the Rational agent’s ability to update its predictive model more rapidly and adjust its chosen course of action in real-time before the round concludes. Consequently, this reduced prediction error against the TFT opponent allows the Rational agent to adopt a more effective strategy, leading to the establishment of a stable equilibrium within the dyad (see Figure 11, middle panel).

3.6. Comparison against Human Data

Finally, in order to inspect which model is closer to human behavior, we compare the results of our models with human data obtained in a study using a similar experimental setup. We analyzed a human behavioral dataset of participants playing the Battle of the Exes game in dyads, published in [35]. In this behavioral experiment, participants played against each other during 50 rounds, so unfortunately, the amount of available data limits the conclusions we can draw from it. In Figure 12, we reproduce the behavioral results of that study against the results of our models playing the Battle of the Exes under a similar payoff structure. When compared with human data, we can see that from the models presented in this study, the Predictive and the Internal models are the closest to human behavior in terms of Stability. Interestingly, this similarity with human behavior occurs only when the models were tested against a greedy opponent. On the other hand, it is important to note that the models showing more diverging results from human data are the pure TD-learning and the Rational predictive model.

4. Conclusions

This work presents a novel control-based cognitive modeling approach to autonomous multi-agent learning models, which combines adaptive feedback control with reinforcement learning. Based on a layered-control architecture, we have designed and implemented agent models characterized by seven different behavioral phenotypes. This includes both deterministic as well as probabilistic learning models. The former category includes three phenotypes: Cooperative, Greedy, and Tit-for-tat, whereas the latter includes four: TD-learning, Rational, Predictive, and Internal. We tested these agent models in dyadic games against each other across five different game theory settings (Harmony Game, Stag Hunt, Hawk-Dove, Prisoner’s Dilemma, and Battle of the Exes). From human behavioral experiments, it is known that the scope of these games sufficiently encompasses behavioral phenotypes in human social decision-making [35,37]. Our proposed agent models were designed and implemented with the aim of deepening our understanding of how cognitive agents learn and make decisions when interacting with other agents in social scenarios (in the context of the game-theoretic tasks stated above).

Additionally, we also implemented our agent models as virtual ePuck robots. The simulated robots are Braitenberg vehicles [41] equipped with sensors and fully embodied controllers (which are the proposed control-based learning modules). These agents are situated in the sense that they only have partial observability of the environment and other agents depending on their location and sensors during rounds. We tested these simulated robots within a spatial continuous-time version of the games involving real-time feedback. We showed that under fully embodied and situated conditions, limitations encountered by agents in ballistic scenarios, such as lack of convergence to an optimal solution or an increase in the number of ties, can be overcome.

Our results show that agents using pure reinforcement-learning strategies are not optimal in most of the tested games (as shown by the Efficacy and Stability metrics). On the other hand, ToM models that predict actions or policies of opposing agents showed faster convergence to equilibria and increased performance. Interestingly, the Rational agent (which knows the best action to take in each game and situation if the opponent’s action is foreseen) outperforms others in many games. However, upon examining human data obtained in the Battle of the Exes game by [35], we found that the Stability profiles of human players matched closest to our Predictive and Internal models rather than the Rational or vanilla TD-learning agents. Although further validation with behavioral data is required to draw any significant conclusion, this result indicates that the models that more closely match human behavior are those that deviate slightly from rationality but do not rely solely on reinforcement learning, thus supporting the notion of bounded rationality in decision-making processes [19].

An especially noteworthy outcome is that both the Predictive and Internal ToM agents successfully acquire the ability to accurately predict their opponent’s actions and policies, respectively, across all games. This shows how autonomous, embodied, and situated multi-agent systems can be equipped with simple Theory of Mind capabilities (in the context of the specified tasks). For future work, it would be interesting to extend these agents with meta-learning models so that they may be able to choose which behavioral phenotype to enact in a given social scenario and flexibly switch in changing environments.

Even though the different behavioral strategies modeled in this study consist of some of the simplest building blocks of human behavior, it is by no means a complete list. Additionally, typical human behavior is not rigidly tied to a given model but rather can flexibly shift between behavioral strategies and learning mechanisms [51,52,53]. Model flexibility is definitely an important consideration for further developing artificial systems capable of ToM. Another issue that is relevant for many real-world tasks is the cost or penalty of being engaged too often in less desirable outcomes such as ties or low rewards [54]. In a sense, this could serve as an incentive to improve performance in complex tasks and should be taken into account when modeling ToM. Yet another typical human feature is trust when dealing with repeated social interactions [55]. Humans tend to form social conventions by agreeing on simple rules or deals and forming a trust relation. At the moment, our models currently lack the ability to form teams that agree on collective goals and work cooperatively to achieve success.

Finally, our control-based reinforcement learning architecture provides an interesting possibility to implement complex social behaviors on multi-agent robotic platforms, such as humanoid robots [56] or other socially competent artificial systems [57]. Another domain that may benefit from the type of agent models described in this paper is the study of evolutionary dynamics; in particular, the evolution of cognitive and intelligent agents. Social interactions involving cooperation and competition are known to play a key role in many evolutionary accounts of biological life and consciousness [58,59]. Lastly, autonomous multi-agent models, of the type described above, with ToM capabilities, might provide useful tools for modeling the dynamics of global socio-political and cultural phenomena such as selective social learning and cumulative cultural evolution [60,61].

Author Contributions

Conceptualization, I.T.F. and X.D.A.; methodology, I.T.F. and X.D.A.; software, I.T.F. and J.-Y.P.; validation, X.D.A., J.-Y.P. and P.V.; investigation, I.T.F. and X.D.A.; resources, P.V.; writing—original draft preparation, I.T.F.; writing—review and editing, all authors; visualization, I.T.F.; supervision, X.D.A., J.-Y.P. and P.V.; project administration, P.V.; funding acquisition, P.V. All authors have read and agreed to the published version of the manuscript.

Funding

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement ID:820742, and from the European Union’s Horizon EIC Grants 2021 under grant agreement ID:101071178.

Data Availability Statement

Data for reproducing the results shown in this paper will be made available in a public repository.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ToM	Theory of Mind
DAC	Distributed Adaptive Control
CRL	Control-based Reinforcement Learning
TD-Learning	Temporal-Difference Learning
TFT	Tit-for-tat

References

Premack, D.; Woodruff, G. Does the chimpanzee have a theory of mind? Behav. Brain Sci. 1978, 1, 515–526. [Google Scholar] [CrossRef]
Baron-Cohen, S.; Leslie, A.M.; Frith, U. Does the autistic child have a “theory of mind”? Cognition 1985, 21, 37–46. [Google Scholar] [CrossRef]
Premack, D. The infant’s theory of self-propelled objects. Cognition 1990, 36, 1–16. [Google Scholar] [CrossRef]
Lanctot, M.; Zambaldi, V.; Gruslys, A.; Lazaridou, A.; Tuyls, K.; Pérolat, J.; Silver, D.; Graepel, T. A unified game-theoretic approach to multiagent reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4190–4203. [Google Scholar]
Lerer, A.; Peysakhovich, A. Learning social conventions in markov games. arXiv 2018, arXiv:1806.10071. [Google Scholar]
Zhao, Z.; Zhao, F.; Zhao, Y.; Zeng, Y.; Sun, Y. A brain-inspired theory of mind spiking neural network improves multi-agent cooperation and competition. Patterns 2023, 100775. [Google Scholar] [CrossRef]
Rabinowitz, N.C.; Perbet, F.; Song, H.F.; Zhang, C.; Eslami, S.; Botvinick, M. Machine Theory of Mind. arXiv 2018, arXiv:1802.07740. [Google Scholar]
Sclar, M.; Neubig, G.; Bisk, Y. Symmetric machine theory of mind. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 19450–19466. [Google Scholar]
Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef]
Yoshida, W.; Dolan, R.J.; Friston, K.J. Game theory of mind. PLoS Comput. Biol. 2008, 4, e1000254. [Google Scholar] [CrossRef] [PubMed]
Baker, C.; Saxe, R.; Tenenbaum, J. Bayesian Theory of Mind: Modeling Joint Belief-Desire Attribution. In Proceedings of the Annual Meeting of the Cognitive Science Society, Boston, MA, USA, 20–23 July 2011; Volume 33. [Google Scholar]
Baker, C.L.; Jara-Ettinger, J.; Saxe, R.; Tenenbaum, J.B. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nat. Hum. Behav. 2017, 1, 0064. [Google Scholar] [CrossRef]
Lake, B.M.; Ullman, T.D.; Tenenbaum, J.B.; Gershman, S.J. Building machines that learn and think like people. Behav. Brain Sci. 2017, 40, e253. [Google Scholar] [CrossRef]
Berke, M.; Jara-Ettinger, J. Integrating Experience into Bayesian Theory of Mind. In Proceedings of the Annual Meeting of the Cognitive Science Society, Toronto, ON, Canada, 27–30 July 2022; Volume 44. [Google Scholar]
Abbeel, P.; Ng, A.Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004; p. 1. [Google Scholar]
Jara-Ettinger, J. Theory of mind as inverse reinforcement learning. Curr. Opin. Behav. Sci. 2019, 29, 105–110. [Google Scholar] [CrossRef]
Wu, H.; Sequeira, P.; Pynadath, D.V. Multiagent Inverse Reinforcement Learning via Theory of Mind Reasoning. arXiv 2023, arXiv:2302.10238. [Google Scholar]
Ruiz-Serra, J.; Harré, M.S. Inverse Reinforcement Learning as the Algorithmic Basis for Theory of Mind: Current Methods and Open Problems. Algorithms 2023, 16, 68. [Google Scholar] [CrossRef]
Kahneman, D.; Slovic, P.; Tversky, A. Judgment under Uncertainty: Heuristics and Biases; Cambridge University Press: Cambridge, UK, 1982. [Google Scholar]
Cuzzolin, F.; Morelli, A.; Cirstea, B.; Sahakian, B.J. Knowing me, knowing you: Theory of mind in AI. Psychol. Med. 2020, 50, 1057–1061. [Google Scholar] [CrossRef]
Albrecht, S.V.; Stone, P. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artif. Intell. 2018, 258, 66–95. [Google Scholar] [CrossRef]
Wang, Y.; Zhong, F.; Xu, J.; Wang, Y. Tom2c: Target-oriented multi-agent communication and cooperation with theory of mind. arXiv 2021, arXiv:2111.09189. [Google Scholar]
Yuan, L.; Fu, Z.; Zhou, L.; Yang, K.; Zhu, S.C. Emergence of theory of mind collaboration in multiagent systems. arXiv 2021, arXiv:2110.00121. [Google Scholar]
Freire, I.T.; Moulin-Frier, C.; Sanchez-Fibla, M.; Arsiwalla, X.D.; Verschure, P.F. Modeling the formation of social conventions from embodied real-time interactions. PLoS ONE 2020, 15, e0234434. [Google Scholar] [CrossRef]
Freire, I.T.; Puigbò, J.Y.; Arsiwalla, X.D.; Verschure, P.F. Limits of Multi-Agent Predictive Models in the Formation of Social Conventions. Artif. Intell. Res. Dev. Curr. Chall. New Trends Appl. 2018, 308, 297. [Google Scholar]
Köster, R.; McKee, K.R.; Everett, R.; Weidinger, L.; Isaac, W.S.; Hughes, E.; Duéñez-Guzmán, E.A.; Graepel, T.; Botvinick, M.; Leibo, J.Z. Model-free conventions in multi-agent reinforcement learning with heterogeneous preferences. arXiv 2020, arXiv:2010.09054. [Google Scholar]
Kleiman-Weiner, M.; Ho, M.K.; Austerweil, J.L.; Littman, M.L.; Tenenbaum, J.B. Coordinate to cooperate or compete: Abstract goals and joint intentions in social interaction. In Proceedings of the CogSci, Philadelphia, PA, USA, 10–13 August 2016. [Google Scholar]
Perolat, J.; Leibo, J.Z.; Zambaldi, V.; Beattie, C.; Tuyls, K.; Graepel, T. A multi-agent reinforcement learning model of common-pool resource appropriation. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3643–3652. [Google Scholar]
Peysakhovich, A.; Lerer, A. Prosocial learning agents solve generalized stag hunts better than selfish ones. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, Stockholm, Sweden, 10–15 July 2018; International Foundation for Autonomous Agents and Multiagent Systems: London, UK, 2018; pp. 2043–2044. [Google Scholar]
Freire, I.T.; Puigbò, J.Y.; Arsiwalla, X.D.; Verschure, P.F. Modeling the Opponent’s Action Using Control-Based Reinforcement Learning. In Proceedings of the Conference on Biomimetic and Biohybrid Systems; Springer: Berlin/Heidelberg, Germany, 2018; pp. 179–186. [Google Scholar]
Gaparrini, M.J.; Sánchez-Fibla, M. Loss Aversion Fosters Coordination in Independent Reinforcement Learners. Artif. Intell. Res. Dev. Curr. Challenges New Trends Appl. 2018, 308, 307. [Google Scholar]
Leibo, J.Z.; Zambaldi, V.; Lanctot, M.; Marecki, J.; Graepel, T. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, Sao Paulo, Brazil, 8–12 May 2017; International Foundation for Autonomous Agents and Multiagent Systems: London, UK, 2017; pp. 464–473. [Google Scholar]
Peysakhovich, A.; Lerer, A. Consequentialist conditional cooperation in social dilemmas with imperfect information. arXiv 2017, arXiv:1710.06975. [Google Scholar]
Nash, J.F. Equilibrium points in n-person games. Proc. Natl. Acad. Sci. USA 1950, 36, 48–49. [Google Scholar] [CrossRef] [PubMed]
Hawkins, R.X.; Goldstone, R.L. The formation of social conventions in real-time environments. PLoS ONE 2016, 11, e0151670. [Google Scholar] [CrossRef]
Hawkins, R.X.; Goodman, N.D.; Goldstone, R.L. The emergence of social norms and conventions. Trends Cogn. Sci. 2018. [Google Scholar] [CrossRef] [PubMed]
Poncela-Casasnovas, J.; Gutiérrez-Roig, M.; Gracia-Lázaro, C.; Vicens, J.; Gómez-Gardeñes, J.; Perelló, J.; Moreno, Y.; Duch, J.; Sánchez, A. Humans display a reduced set of consistent behavioral phenotypes in dyadic games. Sci. Adv. 2016, 2, e1600451. [Google Scholar] [CrossRef]
Sanfey, A.G. Social decision-making: Insights from game theory and neuroscience. Science 2007, 318, 598–602. [Google Scholar] [CrossRef]
Verschure, P.F.; Voegtlin, T.; Douglas, R.J. Environmentally mediated synergy between perception and behaviour in mobile robots. Nature 2003, 425, 620. [Google Scholar] [CrossRef]
Moulin-Frier, C.; Arsiwalla, X.D.; Puigbò, J.Y.; Sanchez-Fibla, M.; Duff, A.; Verschure, P.F. Top-Down and Bottom-Up Interactions between Low-Level Reactive Control and Symbolic Rule Learning in Embodied Agents. In Proceedings of the CoCo@ NIPS, Barcelona, Spain, 9 December 2016. [Google Scholar]
Braitenberg, V. Vehicles: Experiments in Synthetic Psychology; MIT Press: Cambridge, MA, USA, 1986. [Google Scholar]
Corbetta, M.; Shulman, G.L. Control of goal-directed and stimulus-driven attention in the brain. Nat. Rev. Neurosci. 2002, 3, 201. [Google Scholar] [CrossRef]
Koechlin, E.; Ody, C.; Kouneiher, F. The architecture of cognitive control in the human prefrontal cortex. Science 2003, 302, 1181–1185. [Google Scholar] [CrossRef]
Munakata, Y.; Herd, S.A.; Chatham, C.H.; Depue, B.E.; Banich, M.T.; O’Reilly, R.C. A unified framework for inhibitory control. Trends Cogn. Sci. 2011, 15, 453–459. [Google Scholar] [CrossRef] [PubMed]
Den Ouden, H.E.; Kok, P.; De Lange, F.P. How prediction errors shape perception, attention, and motivation. Front. Psychol. 2012, 3, 548. [Google Scholar] [CrossRef]
Wacongne, C.; Labyt, E.; van Wassenhove, V.; Bekinschtein, T.; Naccache, L.; Dehaene, S. Evidence for a hierarchy of predictions and prediction errors in human cortex. Proc. Natl. Acad. Sci. USA 2011, 108, 20754–20759. [Google Scholar] [CrossRef] [PubMed]
Sutton, R.S. Learning to predict by the methods of temporal differences. Mach. Learn. 1988, 3, 9–44. [Google Scholar] [CrossRef]
Axelrod, R.; Hamilton, W.D. The evolution of cooperation. Science 1981, 211, 1390–1396. [Google Scholar] [CrossRef] [PubMed]
Axelrod, R. Effective choice in the prisoner’s dilemma. J. Confl. Resolut. 1980, 24, 3–25. [Google Scholar] [CrossRef]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Lengyel, M.; Dayan, P. Hippocampal contributions to control: The third way. Adv. Neural Inf. Process. Syst. 2007, 20. [Google Scholar]
Freire, I.T.; Amil, A.F.; Verschure, P.F. Sequential Episodic Control. arXiv 2021, arXiv:2112.14734. [Google Scholar]
Rosado, O.G.; Amil, A.F.; Freire, I.T.; Verschure, P.F. Drive competition underlies effective allostatic orchestration. Front. Robot. AI 2022, 9, 1052998. [Google Scholar] [CrossRef]
Sweis, B.M.; Abram, S.V.; Schmidt, B.J.; Seeland, K.D.; MacDonald, A.W., III; Thomas, M.J.; Redish, A.D. Sensitivity to “sunk costs” in mice, rats, and humans. Science 2018, 361, 178–181. [Google Scholar] [CrossRef] [PubMed]
Tutić, A.; Voss, T. Trust and game theory. The Routledge Handbook of Trust and Philosophy; Routledge: London, UK, 2020; pp. 175–188. [Google Scholar]
Moulin-Frier, C.; Puigbo, J.Y.; Arsiwalla, X.D.; Sanchez-Fibla, M.; Verschure, P. Embodied artificial intelligence through distributed adaptive control: An integrated framework. arXiv 2017, arXiv:1704.01407. [Google Scholar]
Freire, I.T.; Urikh, D.; Arsiwalla, X.D.; Verschure, P.F. Machine Morality: From Harm-Avoidance to Human-Robot Cooperation. In Proceedings of the Conference on Biomimetic and Biohybrid Systems; Springer: Berlin/Heidelberg, Germany, 2020; pp. 116–127. [Google Scholar]
Arsiwalla, X.D.; Herreros, I.; Moulin-Frier, C.; Sánchez-Fibla, M.; Verschure, P.F. Is Consciousness a Control Process? In Proceedings of the CCIA, Catalonia, Spain, 19–21 October 2016; pp. 233–238. [Google Scholar]
Arsiwalla, X.D.; Sole, R.; Moulin-Frier, C.; Herreros, I.; Sanchez-Fibla, M.; Verschure, P. The Morphospace of Consciousness. arXiv 2017, arXiv:1705.11190. [Google Scholar]
Gopnik, A.; Meltzoff, A. Imitation, cultural learning and the origins of “theory of mind”. Behav. Brain Sci. 1993, 16, 521–523. [Google Scholar] [CrossRef]
Gavrilets, S. Coevolution of actions, personal norms and beliefs about others in social dilemmas. Evol. Hum. Sci. 2021, 3, e44. [Google Scholar] [CrossRef]

Figure 1. Representation of the Control-based Reinforcement Learning (CRL) model. The top red box depicts the Adaptive layer or the original CRL model, composed of an Actor-Critic Temporal-Difference (TD) learning algorithm. The TD learning algorithm is composed of an Actor module that generates the action-selection policy (P), and a Critic module that estimates the value (V) of a given state. Both the Actor and the Critic are updated by the TD-error e signal between the estimated state value and the actual reward obtained in that state. The bottom green box represents the Reactive layer, with its three sets of sensors, one for the ‘cooperate’ location (

s^{C}

), one for the “defect” location (

s^{D}

), and one for the other agent (

s^{A}

); the two reactive behaviors, “approach cooperate location” (

f^{C}

), “approach defect location” (

f^{D}

); and the two motors, one for the left wheel (

m_{l}

) and one for the right wheel (

m_{r}

). Between the two layers, the inhibitory function (i) regulates which reactive behaviors will be active depending on the action received from the Adaptive layer, while the error monitoring function (

p e

) manages the mismatch between the opponent’s predicted behavior and the actual observation in real-time.

Figure 1. Representation of the Control-based Reinforcement Learning (CRL) model. The top red box depicts the Adaptive layer or the original CRL model, composed of an Actor-Critic Temporal-Difference (TD) learning algorithm. The TD learning algorithm is composed of an Actor module that generates the action-selection policy (P), and a Critic module that estimates the value (V) of a given state. Both the Actor and the Critic are updated by the TD-error e signal between the estimated state value and the actual reward obtained in that state. The bottom green box represents the Reactive layer, with its three sets of sensors, one for the ‘cooperate’ location (

s^{C}

), one for the “defect” location (

s^{D}

), and one for the other agent (

s^{A}

); the two reactive behaviors, “approach cooperate location” (

f^{C}

), “approach defect location” (

f^{D}

); and the two motors, one for the left wheel (

m_{l}

) and one for the right wheel (

m_{r}

). Between the two layers, the inhibitory function (i) regulates which reactive behaviors will be active depending on the action received from the Adaptive layer, while the error monitoring function (

p e

) manages the mismatch between the opponent’s predicted behavior and the actual observation in real-time.

Figure 2. Representations of the TD-learning model. (A) shows a detailed representation of the algorithm components, as implemented in the Adaptive layer of the CRL model [24]. (B) shows a simplified representation showing only the inputs and outputs of that same model.

Figure 3. Representation of the Rational Model. This model is composed of a predictive module (RL) that learns to predict the opponent’s future action and a utility maximization function (U) that computes the action that yields the highest reward based on the opponent’s predicted action. At the end of each round, the RL module is updated based on its prediction error.

Figure 4. Representation of the Predictive Model. This model is composed of a predictive module (RL) that learns to predict the opponent’s future action and a TD learning module (RL) that uses that prediction along with the previous state to learn the optimal policy. At the end of the round, the predictive-RL (green) is updated according to its error in the prediction.

Figure 5. Representation of the Internal Model. This model is composed by two TD learning algorithms. The first one (left, blue) learns to predict the opponent’s future action while the second (right, red) uses a that prediction to learn the optimal policy. The first algorithm is updated with the opponent’s reward and the second with the agent’s own reward.

Figure 6. (A) Top-down visualization of the agents used in the continuous version of the games. The green circles represent the location-specific sensors. The green lines that connect the location sensors with the wheels represent the Braitenberg-like excitatory and inhibitory connections. (B) Image of initial conditions of the dyadic games. The blue circles are the two agents facing each other (representing two ePuck robots viewed from the top). The big green circle represents the ’cooperate’ reward location; the small green circle represents the ‘defect’ reward. The white circles around each reward spot mark the threshold of the detection area.

Figure 7. Main results of the four behavioral models (TD-learning, Rational, Predictive, and Internal) against a Greedy agent across five games. Panel rows show the results of the models at each game, starting from the top row: Prisoner’s Dilemma, Hawk-Dove, Stag Hunt, Harmony Game, and Battle of the Exes. Panel columns show the results across three metrics: Efficacy (left), Stability (center), and Prediction Accuracy (right).

Figure 8. Main results of the four behavioral models (TD-learning, Rational, Predictive, and Internal) against a Nice agent across five games. Panel rows show the results of the models at each game, starting from the top row: Prisoner’s Dilemma, Hawk-Dove, Stag Hunt, Harmony Game, and Battle of the Exes. Panel columns show the results across three metrics: Efficacy (left), Stability (center), and Prediction Accuracy (right).

Figure 9. Main results of the four behavioral models (TD-learning, Rational, Predictive, and Internal) against a TFT agent across five games. Panel rows show the results of the models at each game, starting from the top row: Prisoner’s Dilemma, Hawk-Dove, Stag Hunt, Harmony Game, and Battle of the Exes. Panel columns show the results across three metrics: Efficacy (left), Stability (center), and Prediction Accuracy (right).

Figure 10. Mean results of the four behavioral models (TD-learning, Rational, Predictive, and Internal) against the TD-learning agent across five games. Panel rows show the results of the models at each game, starting from the top row: Prisoner’s Dilemma, Hawk-Dove, Stag Hunt, Harmony Game, and Battle of the Exes. Panel columns show the results across three metrics: Efficacy (left), Stability (center), and Prediction Accuracy (right).

Figure 11. Comparative results of the Hawk-Dove game played in the continuous-time version of the game versus the classical discrete-time.

Figure 12. Results of the predictive models against human data. Model results showed the best fit to human data when tested against a greedy deterministic agent as an opponent. Human data reproduced from [35].

Table 2. Prisoner’s Dilemma.

	Cooperate	Defect
Cooperate	2, 2	0, 3
Defect	3, 0	1, 1

Table 3. Stag Hunt.

	Cooperate	Defect
Cooperate	3, 3	0, 2
Defect	2, 0	1, 1

Table 4. Hawk-Dove.

	Cooperate	Defect
Cooperate	2, 2	1, 3
Defect	3, 1	0, 0

Table 5. Harmony Game.

	Cooperate	Defect
Cooperate	3, 3	1, 2
Defect	2, 1	0, 0

Table 6. Battle of the Exes.

	A	B
A	0, 0	1, 4
B	4, 1	0, 0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Freire, I.T.; Arsiwalla, X.D.; Puigbò, J.-Y.; Verschure, P. Modeling Theory of Mind in Dyadic Games Using Adaptive Feedback Control. Information 2023, 14, 441. https://doi.org/10.3390/info14080441

AMA Style

Freire IT, Arsiwalla XD, Puigbò J-Y, Verschure P. Modeling Theory of Mind in Dyadic Games Using Adaptive Feedback Control. Information. 2023; 14(8):441. https://doi.org/10.3390/info14080441

Chicago/Turabian Style

Freire, Ismael T., Xerxes D. Arsiwalla, Jordi-Ysard Puigbò, and Paul Verschure. 2023. "Modeling Theory of Mind in Dyadic Games Using Adaptive Feedback Control" Information 14, no. 8: 441. https://doi.org/10.3390/info14080441

APA Style

Freire, I. T., Arsiwalla, X. D., Puigbò, J.-Y., & Verschure, P. (2023). Modeling Theory of Mind in Dyadic Games Using Adaptive Feedback Control. Information, 14(8), 441. https://doi.org/10.3390/info14080441

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Modeling Theory of Mind in Dyadic Games Using Adaptive Feedback Control

Abstract

1. Introduction

2. Methods

2.1. Game Theoretic Tasks

2.2. Control-Based Reinforcement Learning

2.3. Agent Models

2.3.1. TD-Learning Model

2.3.2. Rational Model

2.3.3. Predictive Model

2.3.4. Internal Model

2.3.5. Deterministic Agent Models

Greedy

Cooperative/Nice

Tit-for-Tat

2.4. Experimental Setup

3. Results

3.1. Experiment 1: Versus a Deterministic-Greedy Agent

3.2. Experiment 2: Versus a Deterministic-Nice Agent

3.3. Experiment 3: Versus a Tit-for-Tat Agent

3.4. Experiment 4: Versus the TD-Learning Agent

3.5. Experiment 5: Continuous-Time Effects on Prediction Accuracy

3.6. Comparison against Human Data

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI