Reinforcement Learning Methods for Emulating Personality in a Game Environment

Liapis, Georgios; Vordou, Anna; Nikolaidis, Stavros; Vlahavas, Ioannis

doi:10.3390/app15147894

Open AccessArticle

Reinforcement Learning Methods for Emulating Personality in a Game Environment^†

School of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

^*

Author to whom correspondence should be addressed.

^†

This article is a revised and expanded version of a paper published in Liapis, G.; Vordou, A.; Vlahavas, I. Machine Learning Methods for Emulating Personality Traits in a Gamified Environment. In Proceedings of the 13th Conference on Artificial Intelligence (SETN 2024), Piraeus, Greece, 11–13 September 2024.

^‡

These authors contributed equally to this work.

Appl. Sci. 2025, 15(14), 7894; https://doi.org/10.3390/app15147894

Submission received: 19 June 2025 / Revised: 4 July 2025 / Accepted: 11 July 2025 / Published: 15 July 2025

(This article belongs to the Special Issue Innovative Artificial Intelligence Methods, Tools and Methodologies to Address Challenging Real-World Problems)

Download

Browse Figures

Versions Notes

Abstract

Reinforcement learning (RL), a branch of artificial intelligence (AI), is becoming more popular in a variety of application fields such as games, workplaces, and behavioral analysis, due to its ability to model complex decision-making through interaction and feedback. Traditional systems for personality and behavior assessment often rely on self-reported questionnaires, which are prone to bias and manipulation. RL offers a compelling alternative by generating diverse, objective behavioral data through agent–environment interactions. In this paper, we propose a Reinforcement Learning-based framework in a game environment, where agents simulate personality-driven behavior using context-aware policies and exhibit a wide range of realistic actions. Our method, which is based on the OCEAN Five personality model—openness, conscientiousness, extroversion, agreeableness, and neuroticism—relates psychological profiles to in-game decision-making patterns. The agents are allowed to operate in numerous environments, observe behaviors that were modeled using another simulation system (HiDAC) and develop the skills needed to navigate and complete tasks. As a result, we are able to identify the personality types and team configurations that have the greatest effects on task performance and collaboration effectiveness. Using interaction data derived from self-play, we investigate the relationships between behaviors motivated by the personalities of the agents, communication styles, and team outcomes. The results demonstrate that in addition to having an effect on performance, personality-aware agents provide a solid methodology for producing realistic behavioral data, developing adaptive NPCs, and evaluating team-based scenarios in challenging settings.

Keywords:

reinforcement learning; OCEAN five; gamified environment; personality assessment; serious game

1. Introduction

Since AI allows systems to learn, adapt, and make decisions with minimal assistance from humans, it has revolutionized a wide range of industries. Its uses span from robotics and behavioral analysis to healthcare and finance, providing creative answers to challenging issues. Of these, gaming has become one of the most active areas for the use and development of AI methods. It is an area that has experienced rapid development, both in the entertainment industry and in its application to domains such as education and human resource management. The use of game-based methods to improve learning, engage participants, and meaningfully analyze behaviors is growing [1]. The use of virtual escape rooms (ER), which have become a popular gamified approach that promotes teamwork, communication, and problem-solving under time constraints, is a noteworthy example [2]. Companies now use these settings to evaluate both individual and group performance and foster teamwork.

The development of “Game AI,” which refers to the application of AI techniques to improve and automate various aspects of video games, coincides with the rise of gamification [3]. By enabling intelligent agent behavior to produce dynamic content and model player behaviors and interactions, artificial intelligence (AI) plays a crucial role in producing immersive experiences. Game AI has changed how games are created and played, moving away from conventional rule-based systems and toward contemporary machine learning algorithms. Creating agents that can play games, analyzing user behavior, and, more recently, debugging and detecting cheating are some of the primary uses of game AI. Personality modeling can be incorporated into these AI systems to produce agents that not only behave intelligently but also display unique behavioral profiles.

In our system, we developed a digital escape room (ER) simulation with the goal of collecting and examining the actions of individual agents in order to investigate the possibilities of AI-driven personality modeling in gamified settings. Our framework’s first stage focuses on single-agent deep reinforcement learning (DRL), in which each agent is trained to mimic particular personality traits based on the OCEAN Five personality model using a unique reward function. This makes it possible to comprehend in great detail how various traits—like extroversion, agreeableness, or conscientiousness—affect decision-making and problem-solving styles separately.

The framework was expanded into a multi-agent system to model team dynamics in a 3D digital escape room after individual behavioral patterns were identified. Each agent still exhibits a unique personality trait in this scenario, but they now work together to solve problems. We are able to investigate both the emergent characteristics of team performance based on different personality combinations as well as the individual efficacy of each agent by using self-play to generate diverse gameplay data.

Personality-driven agents created using this framework can be used in a variety of real-world scenarios in addition to analysis and simulation. They can improve immersion and variability in game development by acting as intelligent non-player characters (NPCs) with realistic, trait-based behavior. We utilized another simulator, HiDAC [4], as a basis for our system and more specifically for the behaviors the agents need to emulate. Furthermore, by mimicking various player types, these agents can serve as automated testers or debugging tools that assess system robustness, game balance, and usability under various behavioral scenarios.

The contributions of this work are as follows:

A scalable framework that supports both single-agent personality modeling and a multi-agent system for analyzing their performance in a gamified escape room environment.
A new reward mechanism that enables deep reinforcement learning agents to emulate distinct human personality traits, grounded in the OCEAN model, through their behavior and decision-making.
An analysis of how different combinations of personality traits impact team efficiency, coordination, and problem-solving effectiveness for assessment/debugging or NPC creation.

2. Literature Review

In this section, we present the background and context of our research. Next, we provide a review of the related work in the field, and lastly, we showcase the architecture of the reinforcement learning (RL) system.

2.1. Background

In this context, personality refers to consistent patterns of behavior or decision-making, whereas reinforcement learning (RL) provides a framework for agents to learn optimal behaviors through interaction with their environment. Modeling agents with human-like characteristics requires an understanding of how these ideas interact, which is defined in the following sections.

2.1.1. Reinforcement Learning

Reinforcement learning (RL) [5] is a branch of artificial intelligence (AI) that focuses on training agents, autonomous decision-making systems, to interact with dynamic environments. At each step, the agent observes the current state of the environment, uses an RL algorithm to select an action, and receives a delayed reward based on the outcome. The objective is to learn a policy that maximizes cumulative rewards over time. Rewards are given at each stage, which helps the agent improve its approach to decision-making. The process takes place over a number of episodes, which are interactions that begin in a starting point and conclude when a particular objective or condition is met. To maximize long-term results, the agent gradually strikes a balance between exploitation (using learned strategies) and exploration (trying new actions) [6]. Through repeated interaction and feedback, the agent gradually improves its decisions and becomes more effective at solving the given task, [7], which aims to develop intelligent systems—software or hardware—that replicate aspects of human intelligence, such as learning and decision-making.

Usually, a RL model is expressed as a Markov decision process (MDP), which is represented by the five-tuple

M = (S, A, p, γ, R)

. S represents the state space (all possible states the environment can be in at any given time), A is the action space (all possible actions the agent can take in the environment), p is the environment dynamics function (rules or probabilities that define how the environment transitions between states),

γ

is the discount factor (weighing future rewards relative to immediate rewards), and R is the reward function (feedback mechanism that the environment uses to indicate the success or failure of the agent’s actions) [6].

We specify the world context in which our RL agents function as an environment. As determined by the ml-agents [8] package and our configuration files, we employ a centralized critic for the environment dynamics function p and include a discount factor

γ

of 0.99 in our agent system. The following sections provide a detailed discussion of the observation space, action space, and rewards.

In RL environments, autonomous agents must observe, interact with, and adapt to their surroundings to optimize performance over time. To support this, we employ different reinforcement learning strategies depending on the setting: Proximal Policy Optimization (PPO), Posthumous Credit Assignment (POCA), and Soft Actor-Critic (SAC) for single-agent scenarios, and Multi-Agent Posthumous Credit Assignment (MA-POCA) [9] for multi-agent systems. These methods consistently outperform conventional algorithms such as Deep Q-Learning, Trust Region Policy Optimization (TRPO), Independent Q-Learning, and Multi-Agent Advantage Actor–Critic (MA-A2C) [10,11].

PPO is preferred due to its effectiveness and stability. It avoids the high computational cost of TRPO by optimizing a clipped surrogate objective, while still being robust against excessively large policy updates. PPO is scalable across multiple agents in complex environments and well-suited to continuous control problems because of its balance between exploration and exploitation. By adding post-episode credit assignment based on agent contributions, POCA, which is intended for sparse reward settings, expands on PPO. It is particularly helpful in goal-driven single-agent tasks with limited intermediate feedback because it facilitates better learning when rewards are shared or delayed.

Entropy maximization and actor-critic architecture are combined in SAC, a model-free off-policy algorithm, to encourage more varied exploration. It is a good option for single-agent environments that need fine-grained control because of its sample efficiency and capacity to manage continuous action spaces.

By adding a counterfactual baseline to separate individual agent contributions from group outcomes and a centralized value function that estimates returns across all agents, MA-POCA expands on POCA for multi-agent settings. This facilitates more precise credit assignment and encourages lifelong learning, enabling agents who have met their goals to keep improving their policies.

This framework showed strong performance in high-dimensional, dynamic environments by utilizing MA-POCA in cooperative multi-agent systems and PPO, POCA, and SAC in single-agent contexts. It successfully avoids the drawbacks of other approaches, especially those that have trouble scalability or synchronization in complicated tasks.

2.1.2. Personality Traits

In this study, we adopt the OCEAN Five personality trait model, an acronym representing openness, conscientiousness, extroversion, agreeableness, and neuroticism. This model, also known as the Five-Factor Model (FFM), is among the most extensively validated and widely utilized frameworks in personality psychology [12,13].

These traits, which fall into one of two categories—positive or negative—have a big impact on how someone interacts with other people and reacts to different circumstances.

Openness: Shows imagination, interest, and a readiness to try new things. People with high levels of openness make excellent team players who welcome creative approaches to problem-solving.
Conscientiousness: Shows organization, focus, and dedication, all of which are necessary for reaching long-term objectives. Conscientious people are excellent at organized work and teamwork.
Extraversion: Characterizes assertiveness and sociability. Group dynamics are frequently led by extroverts, but introverts may be reluctant to express their opinions, which can affect team performance.
Agreeableness: Stands for empathy and collaboration. Through mutual respect and understanding, highly agreeable people promote smooth teamwork.
Neuroticism: Consists of emotional instability and stress vulnerability. High levels of neuroticism can make it more difficult to adjust and work well with others.

A wide range of psychological concepts, such as temperament, affective disposition, and cognitive orientation, are included in personality assessments. The exact definition and list of personality traits are still up for debate among academics because of their multifaceted nature. Several scholars have responded by putting forth substitute models that build upon or alter the original OCEAN framework. For instance, the Five-Factor structure is expanded upon by the Psychopathic Personality Inventory (PPI) [14], which also introduces improvements in the conceptualization of specific subtraits.

Furthermore, depending on the type and direction of their influence, particular behaviors may be linked to several personality traits. For example, impatience is usually considered to be a negative quality in people who have low agreeableness or conscientiousness scores [15]. Similarly, people with high levels of extroversion and conscientiousness are frequently seen to have leadership tendencies [16]. The intricate and interconnected character of personality-driven behaviors is highlighted by this multifaceted mapping.

The OCEAN model is used in this work because of its broad applicability across disciplines, theoretical stability, and substantial empirical validation. OCEAN, one of the most widely used frameworks in personality psychology [17], provides a solid and thorough basis for simulating individual differences. Its characteristics have proven useful in a variety of settings, from computational modeling to behavioral analysis, which makes it especially well-suited for incorporating personality aspects into adaptive systems and agent-based simulations.

In order to define agent behaviors and reward structures we used the High-Density Autonomous Crowds (HiDAC) simulation system [18] as the foundational environment. Our goals are well served by HiDAC, a high-fidelity crowd simulation framework made to simulate pathfinding and behavioral dynamics in crowded environments like museums.

According to the personality trait that they are supposed to imitate, the agents in our framework are rewarded at the end of each simulation episode. The agent’s performance is guaranteed to reflect the anticipated trait-driven behaviors since the reward assignment is in line with the behavioral patterns specified by HiDAC. Through the mapping of behavior informed by personality to quantitative feedback signals, this method facilitates trait-specific reinforcement learning.

The overall behavior

β

of an individual agent is a function of the different behaviors that it shows in the game and is defined as follows:

β = (β_{1}, β_{2}, \dots, β_{n}), where β_{j} = f (π), for j = 1, \dots, n

(1)

Due to the variability of each personality attribute, the resulting personality score, denoted as

P s i i

, may assume either positive or negative values.

2.2. Related Work

A number of alternative models have been investigated [19], but the OCEAN model is still the most popular personality framework in AI applications because of its simplicity and empirical robustness. Despite doubts about its scientific validity, the MBTI has been used in adaptive interfaces and conversational agents [20]. Early affective computing used Eysenck’s PEN model [21], which highlights the biological foundations of personality. Although it has garnered attention for modeling ethical behavior, the HEXACO model—an extension of OCEAN with an additional honesty–humility dimension—is still not widely used in AI [22]. Finally, the DISC model has been used on occasion, particularly in business applications, in gamified personality simulations and AI coaching [23]. While none of these models challenge OCEAN’s hegemony in empirical AI research, they do demonstrate the variety of methods for incorporating personality into intelligent systems.

Previous research has explored the development of adaptive agents in serious game environments. Notably, a multi-agent system was integrated into the SIMFOR project, a serious game designed for crisis management training [24]. The agents in this system are based on the belief–desire–intention (BDI) deliberation model and include configurable parameters that support scenario design and crisis event simulation. This allows designers to define agent behaviors in accordance with specific crisis management scenarios. While this approach demonstrates sufficient flexibility to be adapted to various serious gaming contexts, its primary aim is scenario construction rather than establishing a standardized methodology for non-player character (NPC) development.

The development of DreamerV3, a general algorithm that outperforms specialized methods in over 150 diverse tasks with a single configuration, is a recent accomplishment of impressive use of DRL for video game play [25]. Using robustness techniques based on normalization, balancing, and transformations to establish learning in multiple domains, DreamerV3 learns a model of the environment and enhances its behavior by imagining future scenarios. When used directly, DreamerV3 is the first algorithm capable of gathering diamonds in Minecraft without the need for human data or curriculum. This has been presented as a major AI challenge that calls for investigating long-term strategies in an open world using sparse rewards and pixels.

Even in intricate 3D maps found in an Ubisoft AAA game created with the Unity engine, a novel method has shown promise [26]. More precisely, a complex architecture was used in conjunction with the SAC algorithm, which minimizes the actor loss function, to train a model-free RL model. A 3D occupancy map, a 2D depth map, absolute goal and agent positions, and additional state variables like relative goal position, speed, acceleration, and prior action are among the various input types that are part of the model’s architecture. To produce the final embedding that is shared by both, each of these input layers is passed through separate feature extraction layers, and the combined output is then fed through multiple linear layers and an LSTM.

Additionally, all NPCs in the prototype roguelike game DeepCrawl are managed by policy networks that have been trained using DRL algorithms [27]. Each player has a different experience in this game as agents learn to keep the player from winning while remaining realistically competitive and still beatable.

In parallel, other works have utilized game environments to simulate and analyze agent behaviors. For instance, [28] examines how a single agent’s movement within a simplified virtual room can reflect behavioral tendencies aligned with the Openness personality trait. Similarly, Refs. [29,30] investigates the potential of escape room gameplay as a means of inferring player personality through interactions with puzzles and riddles. In contrast, our work introduces a flexible agent-based framework that supports single-agent environments, teams of individual agents, and fully multi-agent systems. While personality-driven behavior remains a core focus, the framework also accommodates the development of intelligent NPCs and facilitates debugging and analysis through controlled simulations. Within this gamified environment, agents exhibit complex behaviors informed by multiple personality traits, allowing for the evaluation of both individual and collective performance, as well as system-level dynamics and interactions.

3. Materials and Methods

3.1. Game Mechanics and Environments

We start by describing the main tools used in this project, Unity version 6 [31] and ml-agents version 1 [8], in order to give a thorough grasp of the implementation process. Both tools have been extensively used in the field of deep reinforcement learning (DRL), enabling researchers to load and experiment with various learning environments or create custom ones tailored to specific needs. Unity serves as the simulation environment, while ml-agents is used to integrate machine learning algorithms, allowing for the development and training of intelligent agents. We began the implementation by utilizing assets from the example environments provided in the ml-agents extension package. This initial setup involved creating a simple scene consisting of a single agent game object and a target. The environment is designed to train a doll-like agent to learn to walk towards a dynamically positioned target, as seen in Figure 1.

The next step in creating an agent capable of interacting with the environment is to establish a basic learning environment where the agent can learn fundamental mechanics such as movement, rotation, and object interaction (e.g., doors and targets). This environment consists of a Unity scene containing the agent and various interactive objects. Following this, we developed a more complex environment that simulates a preliminary level design, closely resembling a scenario from RealEscape [32] as seen in Figure 2. This added complexity allows the agent to interact with more dynamic elements, testing its ability to navigate and solve tasks in a more realistic setting. We have set some checkpoints (1 through 4) as can be seen in the image, so that we can quantify the agents’ ability to train and reach the final goal.

Our multi-agent system was implemented as a 3D environment in the Unity platform. Focusing on creating an engaging yet accessible ER experience, we harnessed Unity’s assets to construct a captivating setting.

Our designed environment consists of two buttons (on the walls) that, when activated, unveil a key required for unlocking the final door to escape, as can be seen in Figure 3. The team, consisting of four agents, must navigate through the pillars and past the columns to locate and press buttons.

3.1.1. Action Space

An action in reinforcement learning is a command that the policy generates for the agent to follow, in order to change the environment in a variety of ways and in our implementation ml-agents supports discrete, continuous, and hybrid action spaces [33]. Discrete actions are better suited for tasks like object interaction, whereas continuous spaces are best for controlling physical attributes like velocity. Multi-discrete spaces are also supported by ml-agents for concurrent categorical decision-making. For the learning algorithm to work as best it can, continuous actions must be normalized within the range −1 to 1. Action masking can be used to increase training efficiency and limit invalid actions.

We set our agent’s actions as discrete and selected during gameplay, while its rewards are based on the personality trait it emulates. The action space consists of several key capabilities, including movement (forward/backward), rotation (left/right), speed adjustment (normal/fast), and gesture actions (pointing to objects of interest). The agent can also push other agents naturally through physical movement, which results from physical contact rather than being a discrete action. These actions enable the agent to navigate and interact with the environment and other agents, fostering complex behaviors and cooperation.

3.1.2. State Space

In our implementation, we utilized multiple sensor types to capture different aspects of the environment, each with its unique advantages and trade-offs, like grid-based observations, Raycast sensors, and camera sensors along with vector-based data to give agents a comprehensive understanding of their environment.

Grid-based observations are ideal for capturing a structured 2D spatial representation of nearby objects. Using Unity’s component, the environment is sampled into a grid format where each cell is assigned a one-hot encoded vector corresponding to the detected object type. This results in a 3D tensor that is well-suited for processing by convolutional neural networks (CNNs), enabling the agent to spatially reason about object distribution around it. This sensor is particularly valuable for spatial layout awareness and debugging, as it can help identify untagged objects by scanning game levels

In contrast, Raycast observations provide high-precision detection of specific objects within defined angular ranges and distances. Through the RayPerceptionSensor, rays are projected in multiple directions (15 in total), allowing the agent to detect intersections with tagged objects such as walls, doors, buttons, and other agents. This method gives the agent directional awareness, which is particularly useful for navigation, obstacle avoidance, and interaction tasks requiring precise object localization. Raycasts are computationally inexpensive, making them suitable for baseline agents or simple scenarios (e.g., 1–2 rooms with basic puzzles), particularly when combined with vector observations (e.g., agent position, rotation). However, due to their limited spatial awareness, they are not robust enough for solving complex environments alone.

Camera sensors provide the highest-dimensional and richest input, with pixel-based visual information. This facilitates the development of intelligent agents, such as NPCs, with the same view of the world as human players. Camera-based inputs can help the agent learn more advanced behaviors and solve problems with optimal solutions but slower convergence due to the higher complexity of visual data and increased experience collection requirement.

While Raycasting provides more detailed and direction-sensitive information, which is useful for simulating behaviors, grid-based perception excels at providing spatial layout awareness for debugging purposes. When combined, they provide a strong basis for environmental sensing and have been shown to be the most successful approach. The strengths of each modality—the visual richness of cameras, the directional accuracy of rays, and the spatial context of grids—are advantageous to an agent that employs grid, ray, and camera inputs. Compared to any single-sensor configuration, this multimodal setup produced the best results and the fastest convergence. However, it also led to much longer training times (e.g., more than 2 million steps) and larger model sizes (e.g., .onnx files).

Complementing these spatial techniques, vector observations encode abstract and high-level information as floating-point arrays. These vectors may include the following:

One-hot encoded categorical data, where each category is represented by a binary vector (e.g., detecting a “Door” from [Target, Door, Wall, Obstacle] results in [0, 1, 0, 0]);
Normalized numerical values, scaled to ranges like [0, 1] or [−1, 1] to ensure consistent contribution to learning;
Stacked temporal data, which aggregates past observations to simulate short-term memory, aiding in the detection of dynamics like motion and acceleration.

Together, these three observation types—grid-based, Raycast, and vector—allow agents to perceive their surroundings at multiple levels of detail and abstraction, supporting more sophisticated behavior and decision-making. Rewards during training are assigned based on agent-environment interactions, as detailed in the following section.

3.1.3. Rewards

One of the central challenges in many Markov decision processes (MDPs) is the problem of sparse rewards, where agents receive non-zero feedback only in a limited number of states. This issue is particularly prevalent in escape room (ER) scenarios, where significant rewards are typically granted only upon successful completion of the task (i.e., when all agents escape). As a result, agents may struggle to associate their actions with long-term outcomes, hindering effective policy learning.

To address this, our reward structure is decomposed into two components: (1) task-oriented rewards associated with solving the ER game collaboratively and (2) individual behavior-based rewards that encourage personality-driven or trait-aligned behaviors. These two reward streams are monitored independently. The former captures the overall team performance, assessing whether the agents collectively manage to escape the room, while the latter provides individualized feedback based on each agent’s exhibited behaviors, as elaborated in subsequent sections.

Simple rewards are assigned throughout the training process to reinforce interactions with critical environmental elements. For example, positive rewards are given for actions such as unlocking doors or pressing buttons within an optimal time frame, while negative rewards are assigned for inefficient navigation. The reward system is specifically time-based, encouraging agents to act efficiently and make prompt, effective decisions to maximize their cumulative rewards as seen in Table 1.

Within our system, we created trait-aligned reward functions that connect directly to the behavior representation defined by HiDAC, utilizing its the architecture. It uses a vector

β

to formalize an agent’s overall behavior. Our reward functions are specifically made to translate these discrete behavioral elements into scalar feedback signals. They are weighted by a personality influence factor G, which is represented as a Gaussian. Agents are rewarded based on trait-specific behavioral patterns that align with HiDAC’s behavioral taxonomy thanks to this direct mapping. We can tilt the reinforcement learning process in favor of the emergence of desired personality-aligned behaviors by adjusting parameters such as

G_{E}

for extroversion or

G_{A}

for agreeableness. As a result, agents are rewarded at the conclusion of each simulation episode based on how closely their displayed behavior vector

β

matches the desired personality traits. Through the successful integration of HiDAC’s theoretical behavior model with our unique reward framework, this method ensures that the learned agent policies not only maximize task performance but also exhibit the subtle behavioral signatures anticipated from their assigned personality.

Previously, we defined a multi-agent Markov decision process (MDP) as a tuple

〈 S, A, T, R, γ 〉

where

S is the set of states;
A is the set of actions;
$T (s, a, s^{'})$ is the transition function;
$γ \in [0, 1]$ is the discount factor;
$R_{i} (s, a) = R_{i}^{Task} (s, a) + R_{i}^{τ} (s, a)$ is the reward function for agent i;

and

R_{i}^{τ} (s, a)

is a trait-conditioned reward shaped by the agent’s dominant personality trait

τ \in {Openness, Extraversion, Agreeableness}

.

Trait-Dependent Reward Definitions:
Openness

\begin{matrix} R_{i}^{Openness} (s, a) = \{\begin{matrix} α_{1} & if agent discovers new state \\ α_{2} \cdot CorrectExplore (s, a) \\ 0 & otherwise \end{matrix} \end{matrix}

Extraversion

Let

G_{E} \in [0, 1]

be the extroversion Gaussian score.

\begin{matrix} R_{i}^{Extraversion} (s, a) = \{\begin{matrix} 0.3 \cdot MeanSpeed (s) \cdot G_{E} \\ 1 & if CommActions (s, a) \geq G_{E} \land G_{E} \geq 0.5 \\ 0.6 \cdot G_{E} - 1 & if G_{E} > 0 \\ 1 & if PushCount (s, a) \geq 0.3 \cdot G_{E} \land G_{E} \geq 0.5 \\ WalkSpeed (s) + 1 \\ 10 \cdot CorrectGestures (s, a) \\ ColliderPenalty (s) \\ 0 & otherwise \end{matrix} \end{matrix}

Agreeableness

Let

G_{A} \in [0, 1]

be the agreeableness Gaussian score.

\begin{matrix} R_{i}^{Agreeableness} (s, a) = \{\begin{matrix} 0.3 \cdot (1 - G_{A}) \\ 1 & if PushCount (s, a) \geq 0.3 (1 - G_{A}), G_{A} \leq 0.5 \\ 0.3 \cdot (\frac{RightSideUse (s)}{t}) \cdot G_{A} \\ ColliderPenalty (s) \\ 0 & otherwise \end{matrix} \end{matrix}

Final Reward

R_{i} (s, a) = R_{i}^{Task} (s, a) + R_{i}^{τ} (s, a)

where

R_{i}^{Task}

captures general task success (e.g., escape performance), and

R_{i}^{τ}

incentivizes personality-aligned behaviors.

3.1.4. Training Methodology

As we stated in the previous section, we developed a single agent that tries to achieve specific tasks in a simple and complex environment, such as pressing buttons and opening doors, with the goal of escaping. This methodology was used in a complex environment, while we also used different sensors to see how the results would vary.

Then, by using the best sensors we developed a multi-agent system, at first without the custom-made reward system and then with it. Subsequently, we started implementing the custom reward functions that were specific to certain behaviors into the agents, which enabled us to train them to emulate unique behaviors and personality traits.

To preserve interpretability and separate the impact of individual traits on behavior, each agent was given a single dominant personality trait. Even though real people usually combine a variety of personality traits, simulating several traits per agent would have greatly increased complexity and decreased analytical clarity. We were better able to see how a particular trait affects decision-making and group dynamics by concentrating on just one dominant trait per agent.

Based on their high behavioral expressiveness in group interactions, we chose four traits: agreeable, non-agreeable, introverted, and extrovert. Because they are more inwardly focused and less closely related to the observable, interactive behaviors we sought to model and test, the traits of neuroticism and conscientiousness were purposefully left out. We assessed the behaviors typically linked to each personality trait rather than measuring them directly, which allowed us to test their effects in team-based situations in a methodical manner. To maintain clarity and highlight the most instructive behavioral patterns, we presented a representative subset of configurations because each team had four agents and the trait combinations scaled quickly. So, we display a representative subset of team configurations that capture the most instructive behavioral variations.

We began by training four baseline teams, each of which had an agent with the same personality. Before examining mixed-trait dynamics, this provided us with a solid basis for comprehending the performance of homogeneous personality groups.

Each agent is given a personality trait before training, which establishes the parameters of a matching Gaussian distribution that is used to shape its inclinations. Agents are immersed in the game environment during training, and they receive rewards at the end of each episode according to performance metrics that match their individual personality-driven goals. These crucial steps are followed in the process:

We choose the specific trait we want the agent to emulate (e.g., extroversion);
We set the Gaussian of each trait to define the selected one ( $G_{A}$ = 0, $G_{C}$ = 0, $G_{E}$ = 1, $G_{A}$ = 0, $G_{N}$ = 0)
During training at the end of each episode:
- We collect behavior action metrics (push behavior = number of push actions)
- We reward the agent by multiplying behavior metric with the corresponding Gaussian (R = Pushing behavior * ( $G_{E}$ + $G_{A}$ ))

To maximize agent performance during training, we experimented with a variety of hyperparameter values, as indicated in Table 2. Single-agent setups and multi-agent teams displaying personality traits were found to have quite different optimal configurations. This discrepancy makes sense because multi-agent teams with social dynamics (like agreeableness or extroversion) need larger networks and more substantial training buffers in order to learn efficiently because they involve more intricate interactions and coordination behaviors. Consequently, simpler agents converged effectively with lighter configurations, while the top-performing personality-driven teams used higher values for parameters like batch size, buffer size, and number of hidden units.

To find configurations that optimize agent performance during training, we carried out extensive experiments across a wide range of hyperparameter values, as indicated in Table 2. For example, learning rates varied from 0.003 to 0.005, buffer sizes from 64,000 to 516,000, and batch sizes from 64 to 2048. In the same way, we experimented with different hidden unit sizes (256 to 512), layers (2 to 3), and training epochs (3 to 5) in order to change the model’s complexity. We were able to record the performance landscape in both low- and high-processing settings thanks to this thorough sweep. Since we needed lean and efficient training for simpler environments and more capacity and cautious updates for complex, interactive settings, we made an effort to maintain a consistent hyperparameter tuning approach.

We found that significantly different optimal configurations were needed for single-agent setups and multi-agent teams with different personality traits. This disparity makes sense since multi-agent systems, especially those that simulate social dynamics like extroversion or agreeableness, inevitably entail more intricate coordination, emergent behaviors, and inter-agent interactions.

More specifically, single-agent setups and multi-agent teams had very different optimal configurations as seen in Table 2. Moderate parameters like 28 for the batch size, 128,000 for the buffer size, and 0.005 for the learning rate worked best for single agents, indicating that a balance between learning stability and exploration was adequate for isolated behavior modeling.

On the other hand, more aggressive and resource-intensive configurations were needed for the multi-agent system. With a slightly lower learning rate of 0.004 and a significantly larger batch size of 2048 and buffer size of 256,000, optimal performance was achieved. This modification most likely reflects the emergent coordination in socially diverse teams as well as the increased complexity and non-stationarity brought about by inter-agent interactions. Furthermore, multi-agent teams benefited from deeper architectures with three layers, 512 hidden units, and extended training over five epochs, whereas single agents performed well with two hidden layers and three training epochs. The intuition that multi-agent environments, particularly those that incorporate personality traits like agreeableness or extroversion, require more training iterations and richer representations to capture the dynamics of social coordination is supported by these findings.

Since the single agents task demands were simpler and the learning space was less complicated, these setups with less complex behavior requirements converged effectively with smaller networks and smaller replay buffers. Thus, the need to efficiently capture and optimize the complex, socially influenced behavior patterns inherent to personality-driven multi-agent teams led to the decision to increase training parameters and allocate more computational resources. On the other hand, the complexity of the learning problem is increased when multi-agent teams are required to learn not only how to complete their individual tasks but also how to react adaptively to the actions of others in the environment. Larger neural network architectures (i.e., more hidden units) helped these teams better capture the nuances of non-linear dependencies and social behavior patterns. In order to stabilize training, larger batch sizes and replay buffer capacities were also necessary because they gave the agents access to a wider variety of experiences. This enhanced generalization across a range of social contexts and prevented overfitting to recent experiences.

All single agents were trained for two million steps in each two-hour training session, and multi-agent teams were trained for ten million steps in eight hours. In order to evaluate the agents’ performance and gain a better understanding of how they were interacting and learning in the environment, we looked at (group) rewards, average episode durations, and escape success rate.

4. Results

In the following section, the results from the training of both the single-agent and multi-agent team environments will be shown.

4.1. Singe Agent Results

The results of the single-agent experiments in a simple and complex environment, as shown in Table 3, demonstrate how sensor configurations have a major influence on task effectiveness and success rate (the escape of the agent).

We tested agents with different sensor configurations in our experiments on the simple map. With a mean escape time of 410 s and a 53% success rate, the baseline Raycast agent needed 400,000 training steps. Performance was greatly enhanced by adding a grid sensor: the Raycast + Grid agent converged after 500,000 steps, completed 90% of the trials successfully, and cut the average completion time to 280 s. These findings demonstrate the advantages of spatial context, even in settings that are comparatively simple. Visual input led to further improvements. Despite the increased complexity of visual processing, the Raycast + Camera agent outperformed the Raycast-only baseline, finishing tasks in 331 s with a 66% success rate after 600,000 training steps. The Raycast + Grid + Camera setup produced the best results on the simple map. With only 200,000 steps needed, this agent maintained a 73% success rate, an average escape time of 309 s, and a strong convergence speed-performance balance. These results highlight how crucial multi-sensor fusion is for speeding up learning and enhancing task performance, even in simpler contexts.

Combining sensors allows the agent to make more informed decisions by providing it with richer, more accurate information about the surroundings (such as obstacles, exits, and teammates). Raycasts by themselves, for instance, are directional and sparse. A camera provides visual semantics, and the addition of a grid sensor offers spatial awareness (such as an overhead view of adjacent tiles). When combined, they assist the agent in deciphering situations (e.g., where to move, what is an obstacle, and who is a teammate). The agent’s state representation is more in line with the actual dynamics of the environment when they have a deeper comprehension of it, which increases the significance and learnability of the reward signal. Therefore, the agent learns more quickly and explores more efficiently with more informative inputs, which directly enhances performance and rewards. It also eliminates the need for the agent to “guess” as much.

The failure of escape can frequently result from a limited field of view and insufficient environmental feedback in situations with little sensory input, such as Raycast-only agents, which can make it difficult to locate exits precisely. Agents may repeat ineffective behaviors, become stuck in loops, or overlook important paths as a result of inadequate exploration techniques or subpar early learning. Furthermore, conflicting or unclear sensory inputs can cause indecision or delayed actions if they are not properly fused, even when using richer sensors like cameras or grid detectors. These problems may be made worse by structural difficulties in the surroundings, such as blocked exits or labyrinthine layouts. We tried to tackle some of these problems by implementing a time limit per episode, so that it would reset the environment and agents’ positions.

We moved to the more complicated environment to continue the agent’s learning after it had finished its training in the simplified one. Only the Raycast and grid sensors were kept in order to maximize training stability and efficiency; graphical rendering was left out to cut down on computational overhead. By redefining checkpoints as target objectives, which correspond to reaching and opening doors in the complex environment, the task was restructured to help the agent navigate increasingly complex spatial and interactive challenges.

The agent learns through a series of lessons that become harder over time. In each lesson, it repeats parts of what it has already learned, which helps strengthen its skills and improve gradually. For example, in Lesson 1, the episode ends when the agent reaches the first checkpoint. To move to the next lesson, the agent must meet a random reward goal within the last 100,000 training steps till the next checkpoint.

The success criteria for each lesson are as follows:

Lesson 1: Passed if the agent succeeds in over 90% of the last 100,000 steps.
Lesson 2: Same as Lesson 1—over 90% success.
Lesson 3: Passed if success rate is over 50% in the last 100,000 steps.

The last lesson marked a key turning point, since even though the agent’s performance was lower, it still met the success requirement. It also began earning higher rewards by regularly reaching Checkpoint 3 in the final room. This shows that steady learning and repetition led to long-term improvement.

The agent’s goal changed to finishing the entire room in the last lesson, with no clear success criteria. At this point, learning was instead driven by ongoing interaction and reward optimization, leading to continuous gains in performance. With an average completion time of 320 s and a success rate of 79%, performance improved on the complex map.

These results suggest that these types of agents are highly suitable for application as NPCs in complex environments, such as serious games or simulations. Using the combination of different sensor modalities, grid-based, Raycast and camera inputs, these agents can be programmed for diverse behavior, personality, and decision-making logic that model real-world dynamics very effectively. Their presence not only contributes to the realism and engagement of the game world but also provides a useful debugging tool throughout development.

Grid sensors, for instance, can represent missing or misconfigured object tags by scanning the environment, and agent behaviors can reveal edge cases or unexpected interactions. This dual existence, as both an actor in the game world and a diagnostic instrument, enables developers to iterate faster and build stronger systems. In addition, the agents generate valuable information on NPC interaction, environment response, and performance metrics, advising both gameplay balance and system-level research.

4.2. Multi-Agent Results

Table 4 summarizes the results of our evaluation of different agent team configurations for the multi-agent environment. First, we trained teams of only agents with and without openness, as well as a default team of agents with no particular personality traits. These were used as baselines to determine reference points for subsequent personality-based assessments as well as to evaluate the relative effectiveness of our training methods.

We then trained teams to display particular personality traits, emphasizing extroversion and agreeableness. These characteristics were specifically picked because they are highly relevant to collaborative dynamics and team-oriented behaviors, which makes them especially useful for examining multi-agent cooperation.

Lastly, we trained teams of agents with a variety of personality types, including extroverted, introverted, agreeable, and non-agreeable agents, to better represent real-world personality distributions, such as a 25% introvert to 75% extrovert ratio. This gave us the opportunity to look into how various trait combinations affect the dynamics and performance of the team as a whole.

Figure 4 shows the results of the optimal default multi-agent team (without simulating behaviors). The majority of the time, it learns to solve the room, but occasionally, as the reward drops show, some agents were unable to escape. This is related to the ever-changing environment.

When compared to the default agents, the team with the highest openness scores lower rewards, as seen in Table 4. The main cause of this performance discrepancy is the high openness agents’ inclination to conduct more thorough exploration. Their insatiable curiosity and yearning for novel experiences cause them to spend more time exploring their surroundings, which may be beneficial for learning but leads to less effective task completion.

Both the extrovert-only (orange) and introverted (pink) teams, as shown in Figure 5, learn to solve the room rather quickly, but they have drops of rewards (i.e., not all agents were able to escape). Their sometimes impatient behavior and poor cooperation may be the cause of this. Given their average episode duration, we must observe that introverts (pink) take longer to complete the room because, according to the bibliography, they tend to move more slowly (Figure 6).

While the non-agreeable agents (cyan) were a little less efficient but learned to solve the room a little earlier, they showed some fluctuation of results related to their impatience. In Figure 7, the team with only agreeable (red) agents takes some time to learn to solve the room but has some drops of rewards. We must also highlight that teams of agreeable (red) and non-agreeable agents (cyan) take almost the same amount of time to solve the room, with a fluctuation after a while, as illustrated in Figure 8. This seems to happen because their actions are associated with patience rather than speed.

As demonstrated in Figure 9, the results indicate that although the three extrovert and one introvert (green) team initially collaborates effectively and is highly stable, they eventually cease doing so. This occurs because introverts are more patient and avoid interacting with other players, whereas extroverts are constantly rushing to push each other and receive the rewards. The outcomes of a team consisting of three introverts and one extrovert (gray) that employs reverse personality traits are nearly the same as those of a team with only one introvert agent, albeit with a little less success. This implies that one extrovert has a negative impact on three introverts, but not as much as one introvert among three extroverts. This occurs because if the extroverts ran away and waited for the introvert, the team would slow down. However, if the extrovert stays, they will help the introverted person finish faster and more effectively.

In the same manner, as seen in Table 4, the results show a complex trade-off between teamwork and individual productivity when comparing agreeable and non-agreeable personality traits. Agreeable agents had a higher success rate of 37%, suggesting more frequent team escapes, but they also typically received slightly lower rewards (1819) and took longer to escape (440 s). Non-agreeable agents, on the other hand, demonstrated a propensity for self-serving behavior that compromises team coordination, as evidenced by their faster performance (430 s) and higher reward accumulation (2003), but their much lower success rate of 27%.

In mixed team configurations, this dynamic is further demonstrated. In contrast to its inverse setup with three non-agreeable and one agreeable agent (30%), a group consisting of three agreeable agents and one non-agreeable agent performed better (35% success rate). According to these findings, a cooperative team may be marginally hampered by one non-agreeable agent, but the group’s collaborative tendencies partially offset or compensate for their behavior. Cooperative behavior does not spread, and the team’s coordination suffers when one agreeable agent is paired with a largely uncooperative group. In the end, these results highlight how cooperation can be brittle when outnumbered, but it still stabilizes mixed dynamics.

Having collected a large data from the agents (about 10 million steps per team) and the Central Limit Theorem’s assumption of normalcy, we performed a number of parametric statistical tests to assess the effect of personality composition on team performance. Three primary metrics were the focus of our analysis: success rate (the percentage of episodes in which all agents escaped), mean reward, and mean escape time.

A one-way ANOVA analysis showed that personality had a significant impact on rewards, and post-hoc tests showed that teams with openness had higher rewards but lower success rates (20% vs. 30% for extroversion). Mixed teams performed better: although agreeableness mixtures did not significantly improve performance (p > 0.05), three introverts and one extrovert produced 48% success compared to 27% for default teams. Despite the fact that all comparisons were statistically significant (p < 0.05 after Bonferroni correction), we emphasize practical effects. The three introverts and one extrovert configuration showed the strongest improvement (risk ratio = 1.78 for success rates), despite the fact that openness teams displayed a reward–reliability trade-off (higher mean rewards but 10% lower success compared to extroversion). These results suggest that introvert-dominated compositions maximize escape outcomes, whereas homogeneous openness teams may prioritize short-term gains over mission completion.

The dual-reward structure employed in this study and the intricate dynamics of multi-agent coordination are responsible for the comparatively low success rates across the majority of personality-based teams. Each agent in our system is rewarded from two different sources: one that is connected to completing a task together (e.g., escaping the room as a team) and another that is connected to displaying behaviors that are specific to their personalities. Although this design is helpful for examining the influence of personality traits, it presents a trade-off: agents may favor trait-consistent behaviors over those that would result in group success, even if they are not the best for teamwork. For instance, agreeable agents may over-yield or fail to assert themselves in crucial situations, whereas extroverted agents may act more independently or aggressively (e.g., pushing), potentially upsetting coordinated strategies.

Even minor teamwork errors, such as blocking, hesitation, or poor route selection, can reduce the overall success rate because a trial cannot be deemed successful until all four agents escape. Because common behavioral biases (like a high level of exploration but a low level of follow-through) may worsen rather than counteract these tendencies, the negative effect is higher in teams with uniform personalities. In the end, the interplay between personality-driven incentives and group-level objectives makes full-team escapes especially challenging, especially in teams with limited coordination or trait diversity.

The multi-agent results suggest that the personality traits do significantly influence the group’s effectiveness and their ability to escape quickly. This suggests that reward systems and their setup can, in principle, replicate the traits and behaviors seen in real life.

5. Discussion

We used various perceptual configurations, specifically Raycast and grid-based sensors, to assess agent performance in escape room scenarios during the single-agent phase of our experiments. The agent’s navigation strategy, decision-making ability, and overall task efficiency were all significantly impacted by the sensor selection.

Even in single-agent scenarios, our results highlight how crucial sensor modality and agent design are in determining task performance. Every kind of sensor has special benefits that are matched with particular gameplay mechanics and environmental requirements.

Adding to this, there is a great deal of room for intelligent NPC design when agents are integrated into intricate settings like ERs. It is possible to program agents with multiple sensors (grid, ray, and camera) to mimic a wide range of characteristics, behaviors, and decision-making models. In addition to improving immersion and realism, this yields useful interaction data that can guide system assessment and gameplay adjustments.

Furthermore, deep reinforcement learning (RL) agents were used to simulate human-like decision-making based on their personality traits in order to assess the game’s efficacy. These agents used sophisticated reinforcement learning techniques like MA-POCA to refine their strategies through trial and error as they interacted with their surroundings and altered their behaviors.

Therefore, in the gamified environment where agents must cooperate to complete a task, there is evidence to support the claim that the behaviors and characteristics of each team member impact team efficiency. The results also show that the specific metrics and incentives we have set up can influence the agents’ behavior and help them adopt a different personality. This can also help create a new reward system that creates agents with similar behaviors and sets standards for other types of games.

6. Conclusions

In this work, we introduced a reinforcement learning-based framework that can be used in both single-agent and multi-agent scenarios to model personality-driven behavior in a gamified 3D escape room setting. The objective of our method was to explore the use of custom reward functions derived from the OCEAN Five personality traits model for training agents with distinct behavioral patterns that correspond to specific personality profiles. This framework uses the power of reinforcement learning (RL) to develop agents that can learn complex behaviors that emerge from their personality-aligned incentives instead of being manually scripted. This is achieved through trial-and-error interactions with the environment.

We used Raycast, grid-based, and camera sensors—all of which provide varying degrees of environmental awareness—to train agents in the single-agent setup. While Raycast sensors allowed for rapid, reactive navigation with high-resolution, directional input—perfect for navigating obstacles and interacting with locals—grid-based sensors supported structured exploration and long-term planning through a discretized spatial view. Rich visual input from camera sensors enabled more complex, context-aware behaviors; however, because of their intricacy, they took longer to train. These findings demonstrate how an agent’s sensory configuration has a significant impact on its learning effectiveness, strategy, and task suitability, with each modality performing best in various situations.

To build on this, we extended the system to a multi-agent environment where teams of RL agents, each of whom represented a distinct personality trait, had to cooperate to solve the escape room. The agents’ behavior was changed using personality-aligned reward functions, which placed a particular focus on traits like agreeableness and extroversion, which are closely related to communication and teamwork. Through extensive self-play, the agents improved their coordination and learned to adapt their behavior to the roles and styles of their teammates. Our findings demonstrated that team composition had a noticeable impact on overall performance: agreeable agents promoted quicker, better-coordinated teamwork, while introverted agents were competent but completed tasks more slowly. Although mixed-trait teams initially had trouble collaborating, they eventually developed emergent synergies in certain situations.

These results support the use of reinforcement learning as a technique for accurately modeling complex social dynamics and personality-driven interaction, in addition to being a tool for optimizing agent behavior to make the personality emulation more realistic [34].

Our experiments showed that agents with richer perceptual input performed noticeably better, especially those that combined visual data, grid sensors, and Raycasts. In contrast to the Raycast-only baseline, which took twice as long (400 k steps) and only achieved 53% success, a Raycast + grid + camera agent converged in just 200 k steps on the basic map, achieving a 73% success rate. This demonstrates how multi-modal sensor fusion can speed up learning and increase task completion rates.

Personality-specific team configurations further demonstrated how behavioral traits impact group performance in the multi-agent setup. Due to imbalanced tendencies that make cooperation difficult, uniform teams made up of agents with the same personality frequently performed worse than mixed ones. For instance, a mixed configuration consisting of three introverts and one extrovert achieved 48%, indicating an 18% improvement in success rate compared to the all-default team’s 27%. Interestingly, teams with one agreeable agent and three non-agreeable agents had higher rewards (1991) than their inverse (1740), but the latter had a higher success rate, indicating that coordinated team escape is not always the same as maximizing reward. These findings demonstrate the conflict that exists in multi-agent systems between task-oriented collaboration metrics and rewards based on individual behavior.

By using reward functions to shape behavior, we can simulate human traits without the need for biased or manually modified scripts. Among the numerous practical applications that arise from this are the creation of intelligent agents for automated testing, behavior simulation, or even aiding in the decision-making process of human resource management.

It is important to recognize a number of limitations despite the encouraging outcomes. First, while real human personalities are complex and context-sensitive, personality modeling in this work is simplified by giving each agent a single dominant trait. Each trait’s behavior is determined by custom reward functions, which, although useful, can limit generalization and introduce design bias. Further limiting the richness of social interaction and situational variability are environments that lack the complexity of real-world scenarios, such as basic grid-based escape maps. Even though agents display trait-aligned behaviors, these interactions are still task-bound and low-level, lacking support for higher-order social reasoning skills like empathy, communication, or trust.

Additionally, the results are not validated by humans to determine whether the intended personality traits are viewed as realistic or consistent by observers, even though they can be interpreted in aggregate metrics like reward or success rate. Training multi-agent systems with rich sensory input and trait-driven objectives has nontrivial computational demands that limit scalability and demand substantial resources. Future research on more complex social tasks, human-in-the-loop evaluation, and more nuanced personality blending is necessary in light of these limitations.

In future research, this framework can be extended to incorporate additional OCEAN traits and behaviors or other personality models, more intricate action spaces, and environments that more accurately represent the limitations and unpredictability of the real world. Increasing the expressiveness of reward systems will also enable more precise regulation of emergent behaviors. In addition, incorporating information from actual psychological tests into the training procedure may improve the validity and realism of personality emulation. In order to enhance ecological validity, future research may also investigate the use of real-world behavioral datasets, such as sensor data from actual environments or observational logs of human interactions. To gain a deeper understanding of group dynamics and artificial social intelligence, the framework can also be expanded to competitive or adversarial multi-agent scenarios, where agent personality influences cooperation and conflict.

In conclusion, this work demonstrates how reinforcement learning can be used to create personality-aware agents that can both learn to solve tasks and model intricate human-like behaviors. By combining structured learning, multi-agent interaction, and behavioral emulation into a single framework, we provide a scalable approach to studying and applying personality in intelligent systems.

Author Contributions

Conceptualization, G.L. and I.V.; methodology, G.L., I.V., S.N. and A.V.; software, G.L., I.V., S.N., and A.V.; validation, G.L., I.V., S.N., and A.V.; formal analysis, G.L., I.V., S.N., and A.V.; investigation, G.L., I.V., S.N., and A.V.; resources, G.L., I.V., S.N., and A.V.; data curation, G.L., I.V., S.N., and A.V.; writing—original draft preparation, G.L. and I.V.; writing—review and editing, G.L., I.V., S.N., and A.V.; visualization, G.L. and I.V.; supervision, I.V.; project administration, G.L. and I.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ER	Escape Room
RL	Reinforcement Learning
MDP	Markovian Decision Process
MA-POCA	Multi-agent POsthumous Credit Assignment
OCEAN	Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism
MA-A2C	Multi-agent Advantage Actor–Critic

References

Dicheva, D.; Dichev, C.; Agre, G.; Angelova, G. Gamification in Education: A Systematic Mapping Study. Educ. Technol. Soc. 2015, 18, 75–88. [Google Scholar]
Bakar, M.H.A.; McMahon, M. Collaborative problem solving and communication in virtual escape rooms: An experimental study. J. Interact. Learn. Res. 2021, 32, 135–152. [Google Scholar]
Yannakakis, G.N.; Togelius, J. Artificial Intelligence and Games; Springer: New York, NY, USA, 2025. [Google Scholar]
Kapadia, M.; Shoulson, A.; Durupinar, F.; Badler, N.I. Authoring Multi-actor Behaviors in Crowds with Diverse Personalities. In Modeling, Simulation and Visual Analysis of Crowds; Ali, S., Nishino, K., Manocha, D., Shah, M., Eds.; The International Series in Video Computing, Vol. 11; Springer: New York, NY, USA, 2013; pp. 113–132. [Google Scholar] [CrossRef]
Ghasemi, M.; Ebrahimi, D. Introduction to Reinforcement Learning. arXiv 2024, arXiv:2408.07712. [Google Scholar] [PubMed]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; Adaptive Computation and Machine Learning Series; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Russell, S.J.; Norvig, P. Artificial Intelligence: A Modern Approach, 4th ed.; Pearson: Boston, MA, USA, 2020. [Google Scholar]
Juliani, A.; Berges, V.-P.; Teng, E.; Cohen, A.; Harper, J.; Elion, C.; Goy, C.; Gao, Y.; Henry, H.; Mattar, M.; et al. Unity: A General Platform for Intelligent Agents. arXiv 2020, arXiv:1809.02627. [Google Scholar]
Cohen, A.; Teng, E.; Berges, V.-P.; Dong, R.-P.; Henry, H.; Mattar, M.; Zook, A.; Ganguly, S. On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning. In Proceedings of the Reinforcement Learning in Games Workshop at AAAI 2022, Vancouver, BC, Canada, 28 February–1 March 2022. [Google Scholar]
Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.V.; Lanctot, M.; de Freitas, N. Dueling Network Architectures for Deep Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015. [Google Scholar]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.P.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016. [Google Scholar]
Jang, K.L.; Livesley, W.J.; Vernon, P.A. Heritability of the big five personality dimensions and their facets: A twin study. J. Personal. 1996, 64, 577–591. [Google Scholar] [CrossRef] [PubMed]
Lo, M.T.; Hinds, D.A.; Tung, J.Y.; Franz, C.; Fan, C.C.; Wang, Y.; Chen, C.H. Genome-wide analyses for personality traits identify novel loci and pathways. Nat. Commun. 2021, 12, 1–11. [Google Scholar]
Uzieblo, K.; Verschuere, B.; Van den Bussche, E.; Crombez, G. The validity of the Psychopathic Personality Inventory-Revised in a community sample. Assessment 2010, 17, 334–346. [Google Scholar] [CrossRef] [PubMed]
Jiang, N.; Shi, M.; Xiao, Y.; Shi, K.; Watson, B. Factors Affecting Pedestrian Crossing Behaviors at Signalized Crosswalks in Urban Areas in Beijing and Singapore. In Proceedings of the ICTIS 2011, Wuhan, China, 30 June–2 July 2011; p. 1097. [Google Scholar] [CrossRef]
Hirschfeld, R.R.; Jordan, M.H.; Thomas, C.H.; Feild, H.S. Observed leadership potential of personnel in a team setting: Big five traits and proximal factors as predictors. Int. J. Sel. Assess. 2008, 16, 385–402. [Google Scholar] [CrossRef]
Durupinar, F.; Allbeck, J.M.; Badler, N.I.; Guy, S.J. Navigating performance: Surfing on the OCEAN (Big Five) personality traits. J. Econ. Bus. Account. 2024, 7, 767–783. [Google Scholar]
Durupinar, F.; Pelechano, N.; Allbeck, J.; Gudukbay, U.; Badler, N. How the Ocean Personality Model Affects the Perception of Crowds. IEEE Comput. Graph. Appl. 2011, 31, 22–31. [Google Scholar] [CrossRef] [PubMed]
Du, H.; Huhns, M.N. Determining the Effect of Personality Types on Human-Agent Interactions. In Proceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), Atlanta, GA, USA; 2013; Volume 2, pp. 239–244. [Google Scholar] [CrossRef]
Liu, S.; Rizzo, P. Personality-aware virtual agents: Advances and challenges. IEEE Trans. Affect. Comput. 2021, 12, 1012–1027. [Google Scholar]
DeYoung, C.G.; Krueger, R.F. Understanding personality through biological and genetic bases. Annu. Rev. Psychol. 2021, 72, 555–580. [Google Scholar]
Ashton, M.C.; Lee, K.; De Vries, R.E. The HEXACO Model of Personality Structure and the Importance of Agreeableness. Eur. J. Personal. 2020, 34, 3–19. [Google Scholar]
Kumar, S.; Singh, V. Behavior modeling for personalized virtual coaching using contemporary personality theories. Int. J. Hum.-Comput. Stud. 2021, 146, 102557. [Google Scholar]
Oulhaci, M.; Tranvouez, E.; Fournier, S.; Espinasse, B. A MultiAgent Architecture for Collaborative Serious Game applied to Crisis Management Training: Improving Adaptability of Non Played Characters. In Proceedings of the 7th European Conference on Games Based Learning (ECGBL 2013), Porto, Portugal, 3–4 October 2013. [Google Scholar]
Hafner, D.; Pavsukonis, J.; Ba, J.; Lillicrap, T.P. Mastering Diverse Domains through World Models. arXiv 2023, arXiv:2301.04104. [Google Scholar]
Alonso, E.; Peter, M.; Goumard, D.; Romoff, J. Deep Reinforcement Learning for Navigation in AAA Video Games. In Proceedings of the International Joint Conference on Artificial Intelligence, Yokohama, Japan, 11–17 July 2020. [Google Scholar]
Sestini, A.; Kuhnle, A.; Bagdanov, A.D. DeepCrawl: Deep Reinforcement Learning for Turn-based Strategy Games. arXiv 2019, arXiv:2012.01914. [Google Scholar]
Liapis, G.; Lazaridis, A.; Vlahavas, I. Escape Room Experience for Team Building Through Gamification Using Deep Reinforcement Learning. In Proceedings of the 15th European Conference of Games Based Learning, Online, 23–24 December 2021. [Google Scholar]
Liapis, G.; Zacharia, K.; Rrasa, K.; Liapi, A.; Vlahavas, I. Modelling Core Personality Traits Behaviours in a Gamified Escape Room Environment. Eur. Conf. Games Based Learn. 2022, 16, 723–731. [Google Scholar] [CrossRef]
Liapis, G.; Vordou, A.; Vlahavas, I. Machine Learning Methods for Emulating Personality Traits in a Gamified Environment. In Proceedings of the 13th Conference on Artificial Intelligence (SETN 2024), Piraeus, Greece, 11–13 September 2024; pp. 1–7. [Google Scholar] [CrossRef]
Unity Technologies. Unity Real-Time Development Platform | 3D, 2D VR & AR Engine. Unity, version 6000.1.11f1; Game Development Platform. 2025. Available online: https://unity.com/ (accessed on 18 December 2024).
RealMINT ai. 2024. RealEscape. Available online: https://store.steampowered.com/app/3301090/RealEscape/ (accessed on 10 December 2024).
Delalleau, O.; Peter, M.; Alonso, E.; Logut, A. Discrete and Continuous Action Representation for Practical RL in Video Games. arXiv 2019, arXiv:1912.11077. [Google Scholar]
Justesen, N.; Bontrager, P.; Togelius, J.; Risi, S. Deep learning for video game playing. IEEE Trans. Games 2019, 12, 1–20. [Google Scholar] [CrossRef]

Figure 1. Simple single-agent environment.

Figure 2. Complicated single-agent environment (The numbers 1-4 indicate the 4 parts of the map that correspond to agent’s training lessons).

Figure 3. Multi-agent environment (The boxes on the left depict the agents’ and buttons’ initial positions, while the arrows on the right showcase how the agents press the buttons and the box the agent that takes the key.).

Figure 4. Default (without simulating behaviors) multi-agent team.

Figure 5. Team rewards for introverts (pink) and extroverts (orange).

Figure 6. Length of episode for introvert (pink) and extrovert (orange).

Figure 7. Agreeable (red) and non-agreeable (cyan) agents team rewards.

Figure 8. Agreeable (red) and non-agreeable (cyan) episode length.

Figure 9. Three extrovert and one introvert (green) and three introverts and one extrovert (gray) agents team rewards.

Table 1. Simple Rewards.

Behavior	Reward
Pressed button	0.4 × time
Pick Up Key	0.4 × time
Unlock door	0.4 × time
Escaped	1 × time
Escaped all	10 × time
Collided (other agent or obstacle)	−0.01 × time

Table 2. Hyperparameters and best values.

Hyperparameters	Single-Agent Values	Multi-Agent Values
batch size	128	2048
buffer size	128,000	256,000
learning rate	0.004	0.005
hidden units and	512	512
number layers	2	3
epochs	3	5

Table 3. Single-agent results effectiveness.

Map	Sensors	Optimal Steps ¹	Mean Escape Time	Success Rate
Simple	Raycast	400 k	410	53%
	Raycast + Grid	500 k	280	90%
	Raycast + Camera	600 k	331	66%
	Raycast + Grid + Camera	200 k	309	73%
Complex	Raycast + Grid	900 k	320	78%

¹ Elbow method

Table 4. Uniform personality teams and effectiveness.

id	Personality (+/−)	Mean Reward ¹	Mean Escape Time ²	Success Rate (All Agents Escaped)
1	Default	1970	410	27%
2	Openness	1695/2154	505/411	37/20%
3	Extraversion	1776/1891	428/409	29/30%
4	Agreeableness	1819/2003	440/430	37/27%
5	3 Extroverts and 1 Introvert	1150	380	39% (+10%)
6	3 Introverts + 1 Extrovert	1318	403	48% (+18%)
7	3 Agreeableness and 1 Non Agreeableness	1740	435	35% (−2%)
8	3 Non Agreeableness + 1 Agreeableness	1991	445	30% (+2%)

¹ The final reward number based on agent actions. ² The mean play time in seconds.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liapis, G.; Vordou, A.; Nikolaidis, S.; Vlahavas, I. Reinforcement Learning Methods for Emulating Personality in a Game Environment. Appl. Sci. 2025, 15, 7894. https://doi.org/10.3390/app15147894

AMA Style

Liapis G, Vordou A, Nikolaidis S, Vlahavas I. Reinforcement Learning Methods for Emulating Personality in a Game Environment. Applied Sciences. 2025; 15(14):7894. https://doi.org/10.3390/app15147894

Chicago/Turabian Style

Liapis, Georgios, Anna Vordou, Stavros Nikolaidis, and Ioannis Vlahavas. 2025. "Reinforcement Learning Methods for Emulating Personality in a Game Environment" Applied Sciences 15, no. 14: 7894. https://doi.org/10.3390/app15147894

APA Style

Liapis, G., Vordou, A., Nikolaidis, S., & Vlahavas, I. (2025). Reinforcement Learning Methods for Emulating Personality in a Game Environment. Applied Sciences, 15(14), 7894. https://doi.org/10.3390/app15147894

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning Methods for Emulating Personality in a Game Environment^†

Abstract

1. Introduction

2. Literature Review

2.1. Background

2.1.1. Reinforcement Learning

2.1.2. Personality Traits

2.2. Related Work

3. Materials and Methods

3.1. Game Mechanics and Environments

3.1.1. Action Space

3.1.2. State Space

3.1.3. Rewards

3.1.4. Training Methodology

4. Results

4.1. Singe Agent Results

4.2. Multi-Agent Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Reinforcement Learning Methods for Emulating Personality in a Game Environment †

Abstract

1. Introduction

2. Literature Review

2.1. Background

2.1.1. Reinforcement Learning

2.1.2. Personality Traits

2.2. Related Work

3. Materials and Methods

3.1. Game Mechanics and Environments

3.1.1. Action Space

3.1.2. State Space

3.1.3. Rewards

3.1.4. Training Methodology

4. Results

4.1. Singe Agent Results

4.2. Multi-Agent Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Reinforcement Learning Methods for Emulating Personality in a Game Environment^†