PokerOWL: A Multi-Agent Poker Environment for Benchmarking Open-World Learning

Chiu, Min-Hsueh; Nananukul, Navapat; Kejriwal, Mayank

doi:10.3390/app16115458

Open AccessArticle

PokerOWL: A Multi-Agent Poker Environment for Benchmarking Open-World Learning

by

Min-Hsueh Chiu

,

Navapat Nananukul

and

Mayank Kejriwal

^*

Information Sciences Institute, University of Southern California, Los Angeles, CA 90292, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5458; https://doi.org/10.3390/app16115458

Submission received: 26 April 2026 / Revised: 17 May 2026 / Accepted: 22 May 2026 / Published: 31 May 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

In complex task environments in both nature and human society, structuralviolations of expectation (VoE) occur with non-trivial frequency. Agents that are designed to operate in such environments must be capable of open-world learning (OWL), defined as the ability to detect and accommodate out-of-distribution inputs, as well as more complex structural VoEs, without requiring extensive and offline re-training. Until recently, OWL research was relatively constrained and limited to areas such as anomaly detection and concept drift. More recently, agent-based OWL research has witnessed much interest from across the community. To support this research, not just for developing OWL algorithms, but also evaluating them, there is a need for multi-agent environments where structural VoEs can be generated, and controlled experiments can be run with relative ease. To address this need, we propose a resource called PokerOWL, a platform that is supported on the Gymnasium infrastructure, which is extensively used in the reinforcement learning and AI game-playing communities. PokerOWL supports both a rich VoE generator and a graphical interface for facilitating development and evaluation of OWL methods. Using an extensive set of experiments and a Poker-playing agent based on Deep Q-Networks, we use PokerOWL to demonstrate how even state-of-the-art agents can struggle to generalize to novel situations without additional OWL capabilities.

Keywords:

open-world learning; Poker; deep reinforcement learning; benchmarking platform; violation of expectation

1. Background

Recent years have witnessed significant advances in both deep reinforcement learning and generative AI [1,2,3]. Most of these systems are trained to handle real-world scenarios, such as game-playing, question answering, and dialog generation, where the task distribution remains relatively stable and predictable. However, emerging paradigms such as Open-World Learning (OWL) place adaptation to novelty at their core. This raises a fundamental question: can these models learn the underlying principles required to adapt to Violations of Expectation (VoEs) and other forms of novelty in open-world contexts [4,5]?

Recent years have witnessed significant advances in both deep reinforcement learning and generative AI [1,2,3] but many questions still remain about whether these models have learned the fundamental principles necessary for adapting to unexpected and structural Violations of Expectation (VoEs) or more colloquially, novelty [4,5]. In agent-based environments, VoEs may be defined as states or situations that violate implicit or explicit assumptions about agents, the environment, or agent–agent and agent–environment interactions in a given task environment. In both nature, as well as complex real-world situations occurring in human life, such VoE occur with more frequency (i.e., have a fatter long-tail) than even a normal distribution would suggest [6,7]. Famous examples of such novelty are known to arise in domains of varying complexity [8,9,10,11,12].

The ability to accommodate VoEs in the environment is a hallmark of an area of research called open-world learning (OWL) [13,14]. A recent perspective [15] on OWL has argued that, currently, their tremendous success notwithstanding, AI models still tend to be lacking in robustness on OWL problems. A simple and illustrative example can be found in the game of chess [16], which boasts numerous variants like double chess, featuring modified rules such as two full armies per side on a 12×16 board, with pawns advancing up to four steps on their first move. In such cases, AI systems like AlphaZero [17], initially trained on the standard version of chess, may require re-training to adapt to these new rules [18], as opposed to generalizing from the initial learned model by drawing on fundamental principles. The openness in real-world OWL environments may originate from stochasticity or hidden information, which may not be known during training. Alternatively, adversarial attacks and human creativity can also alter the operating environment in radically unexpected ways. For example, events like animals crossing the street and traffic lights melting during a heat wave [19] demonstrate the stochasticity of the real-world OWL environment, leading to the openness in self-driving system training. Without safeguards or additional learning mechanisms, openness can prove detrimental to AI systems as the assumption that the test distribution resemble the training distribution is problematic in practice [20,21,22], despite its theoretical and analytical advantages.

Given the rising importance of building OWL-capable AI agents and evaluation infrastructures, there has been progress on developing AI agents capable of detecting and adapting to novelties in near real-time, especially in single-agent task environments (like foraging and problem-solving in a 3D environment) [23,24]. Unfortunately, multi-agent platforms for supporting OWL research are few and far between [25]. Moreover, recent successful OWL environments [26,27,28]—such as 2D grid-based games and 3D environments like Minecraft—primarily focus on agent-controlled interactions where objects exhibit consistent and predictable behavior. Novelty in these environments arises from positional factors or the introduction of new objects. While useful as a first step, an open-source infrastructure for profiling and benchmarking computational behavior of agents for more complex VoEs, including changes in game rules and agent behaviors, is lacking. Such infrastructure would enable systematic evaluation of agent robustness to structural VoEs and accelerate development of more adaptive models.

One such environment that would be particularly useful for supporting OWL research is Poker, owing to its potential for testing decision making under conditions of uncertainty [29], and its close connections to game theory [30,31]. In the Poker tournament circuit, players frequently engage in deception, change their tactics (both within and across days and tournaments), and leverage considerable ‘soft’ knowledge that they glean through behavioral observation at the table. Furthermore, there are many variants of Poker RL agents [32,33,34,35], and expert players are usually able to adapt to new rules fairly quickly. However, a Poker-playing OWL agent that accommodates structural VoEs does not yet exist.

To address the lack of open-source infrastructure for evaluating agents under complex VoEs, we propose PokerOWL, a Poker-based platform for developing and evaluating OWL agents that accommodate structural VoEs of varying complexity. PokerOWL is built on the widely-used Gymnasium infrastructure and is meant to support OWL research as a first-class citizen. It provides a modular implementation of limit Texas Hold’Em, a famous variant of Poker. The configurable environment enables users to simulate games based on the original rules, with customizable gaming parameters, logging, and visualization capabilities. In addition to the original version of limit Texas Hold’Em, PokerOWL also contains an implementation of a novelty generator that allows VoEs across different categories to be injected. Currently, the novelty generator can be used to inject 25 ‘atomic’ VoEs with a combinatorially explosive set of possibilities, and is designed to allow external developers to expand the generator with their own novelties to facilitate a rich suite of OWL experiments.

PokerOWL provides a testbed for investigating how agents adapt to structural VoEs of varying complexity. To showcase PokerOWL’s computational facilities, we design and conduct an extensive series of experiments using a Deep Q-Network (DQN) implementation of a Poker-playing agent [36]. We benchmark this agent in the original setting, as well as simulations with VoEs that the agent did not originally train on, demonstrating how even state-of-the-art models like DQN struggle to generalize to novel situations without additional OWL capabilities. In prior work, conducting such experiments required significant ad-hoc implementation; however, with PokerOWL, we are able to conduct thousands of OWL-based computational experiments and analyze the collected data with relative ease. The remainder of this paper is organized as follows: we detail the PokerOWL framework, its environment, novelty primitives, and agent implementations; describe our experimental protocol for benchmarking agent performance; present empirical results demonstrating how DQN responds to structural changes; and discuss implications for open-world learning research and identify directions for future work.

2. PokerOWL Architecture

In this section, we present PokerOWL, a framework to evaluate the robustness of Poker-playing agents across both original and VoE settings. Figure 1 provides an overview of the framework. PokerOWL is implemented in a Gymnasium-style environment and effectively draws upon an object-oriented approach to structure key Poker utilities, such as dealer, phase, board, player, card, agent, and action components. It provides a standardized Application Programing Interface (API) for interaction between the user and various environments, including initially published ones such as object controlling, MuJoCo, and Atari [37]. The object-oriented approach improves the framework’s modularity and extensibility, and allows us to implement the novelty generator in a plug-and-play fashion. PokerOWL is installable via https://pypi.org/project/gym-open-poker/ and is available at https://github.com/minhsueh/gym-open-poker, all accessed on 10 October 2024.

2.1. Environment

Original Texas Hold’em Environment: Before a game begins, seats and relative positions are predefined. The player in the dealer position, also called the button, is granted the privilege of acting last in the betting. The player immediately to the left of the button (facing the table), is known as the small blind, while the player immediately to the left of the small blind is known as the big blind. When a game begins, the players in these two positions must respectively place predefined (small and big) blind amounts into the pot to start the action.

A typical game of Poker involves four distinct rounds, unless the game ends in an earlier round: pre-flop, flop, turn, and river. Each round starts with cards being dealt, followed by players betting until the round concludes. Note that Poker games are played out over numerous tournaments to evaluate an agent’s strategic capabilities against different sets of opponents by the end of the tournament.

Environment with Violations of Expectation (VoE): The environment is designed specifically to test an agent’s performance in the OWL setting. We implement VoE using an open-world novelty generator to introduce new and unexpected scenarios. It is implemented using a wrapper module sourced from Gymnasium [38]. To ensure sufficient diversity of novelty, we define five primary novelty categories, each linked to a module within PokerOWL: agent, action, card, concludeGame, and gameElement. Each category contains multiple primitives that modify some aspect of the original game at an elemental level (see Supplementary Materials). Primitives range from simple parameter variations in existing game functions (e.g., BuyIn modifies the buy-in amount), to more sophisticated logical modifications (e.g., SeatChanging alters the player’s seat each game).

The use of independent primitives that can be combined allows us to generate a combinatorially expansive set of novelties, while preserving the underlying game structure by disallowing primitives that involve syntactic changes to the observation or action space, e.g., through the introduction of an unknown action. Primitives implemented within the same category are mutually exclusive, as they are designed to alter the same module. This systematic approach also allows us to maintain precise control over the novelty injection process.

2.2. Agents

2.2.1. Background Agents

As Poker is a multi-agent game, we use three background agents adapted to different testing scenarios. These agents (each embodying a strategy) are denoted as AgentR, AgentD, and AgentP. AgentR operates by executing random actions from the allowable actions. In contrast, AgentD adopts a passive approach, consistently participating in each hand by either calling or checking to match the highest bet, without ever opting to fold or raise. Finally, AgentP is equipped with a rule-based strategy that dictates participation based on specific conditions. For instance, it is programmed to raise when holding a high pair in the pre-flop or when possessing hands with a high-enough probability of obtaining combinations superior to a three-of-a-kind hand ranking. This strategic approach gives AgentP a competitive edge by allowing it to make calculated decisions based on the current state of the board.

2.2.2. Deep Q-Network Agent

We use a deep reinforcement learning-based agent specifically designed for PokerOWL, but also applied in earlier work in other multi-agent games like Monopoly [39]. Drawing inspiration from seminal work such as by Mnih et al. [40] the agent’s architecture embodies the principles of Deep Q-Networks (DQN). While this architecture is well established and is known to be state-of-the-art in many domains [36], its robustness to VoEs and its performance in OWL settings has not been empirically established. We demonstrate the utility of PokerOWL by showing how it can be used to conduct such experiments using the DQN agent as the player-1 ‘user.’

Observation Space: AgentDQN uses observations directly from PokerOWL, which includes the game index, round, pot amount, community cards, dealer and player positions, hole cards, player’s bankroll, and the previous action. These observations collectively form a 32-dimensional vector, constituting the agent’s observation space representation.

State Space: Given the sequential nature of Poker actions and the requirement for consecutive actions to achieve rewards, four consecutive observations are concatenated to form a single state representation. This approach enables the agent to capture the game’s dynamics across multiple time steps.

Action Space: The action space of the agent comprises six fundamental actions: call, bet, raise_bet, check, fold, and all_in. These actions encompass the strategic choices open to the agent during gameplay to optimize performance.

Reward Function: Poker presents a sparse reward environment, wherein rewards are only received at the end of each game. The reward is determined by the difference in the player’s bankroll between consecutive games. Additionally, the step rewards were calculated retroactively from the final reward at the end of the game. Specifically, each previous step reward is iteratively assigned as the final reward, multiplied by a decay factor (delta) to the power of the time step count. For example, if the final reward is 100, the delta is 0.9, and the player performs four actions in this game—call, check, check, check—the step reward would be 72.9, 81,90, and 100, respectively.

Architecture: The agent’s neural network architecture is designed to process the environment state efficiently. It takes a 128-dimensional vector obtained by concatenating four frames of 32-dimensional observations as input. The network comprises two hidden layers containing 512 neurons with sigmoid activation functions. Dropout layers with rate 20% are incorporated after each hidden layer to mitigate overfitting. The output layer has a dimension of 6, with each element representing the Q-value for one of the available actions. During training, the agent uses replay memory to effectively improve learning stability and counteract catastrophic forgetting. Operating with a replay memory size of 50,000 and a sampling size of 256, this mechanism allows for smoother learning by offering a diverse pool of past experiences. This is especially advantageous in addressing the challenges of delayed rewards and complex gameplay dynamics. Additionally, the target network is updated every 32 steps to stabilize the training process. The AgentDQN is trained in the original setting to establish a baseline performance. Following this, the agent is evaluated in an OWL environment containing VoEs.

The performance and action profile of AgentDQN at different learning rates is illustrated in Figure 2. Each blue point represents AgentP’s performance when competing against AgentDQN at a specific learning rate. They are plotted on the same x-axis for better comparison. For each data point, the corresponding action profile is also plotted. AgentP performs consistent actions, most likely folding. AgentDQN exhibits more complex actions at a learning rate of

10^{- 4}

, in contrast to other learning rates where it performs more monotonous actions.

2.3. Visual Interface

PokerOWL also includes a visual interface (Figure 3), implemented using a dynamic PyGame-based framework, for allowing both developers and human users to have straightforward observation and tracking of gameplay. The top left panel displays the game parameters, including the following: small blind amount, big blind amount, game index, and betting structure. The game is currently in the pre-flop of the first game. Following the placement of the small blind (player_3) and big blind (player_6) forced bets into the pot and the dealing of hole cards to each player, the betting commences with player_2 (the player seated to the left of the big blind). In this instance, player_2 folded, while player_5 called and contributed $10 to the pot. Now, it is the user’s turn to take action, with the available actions displayed in the bottom right panel. In this scenario, the user can choose to call, raise, or fold. After the user inputs their desired action into PokerOWL, the environment will respond accordingly and prompt the user for the next move. This process continues iteratively until the next action is required.

An envisaged use of the interface is to promote efficient debugging of the novelty generator as the primitives-set undergoes further extension However, we emphasize here that PokerOWL can function independently of the visual interface, with action and observation spaces accessible in symbolic form. The interface is also meant to support human experiments involving OWL: for example, for assessing which novelties humans are most adept at detecting and how quickly they can identify them as gameplay extends over several betting rounds. Such experiments are important for establishing human performance baselines for novelty detection and adaptation, which are currently lacking in the literature.

3. Experimental Setup

In this section, we explain the experimental design and data collection protocol that we use to evaluate the agents. We also introduce the evaluation metrics used to evaluate AgentDQN’s performance. The experimental code and the corresponding results are reproduced in https://github.com/minhsueh/open-world-expt, accessed on 10 October 2024. Experiments were run locally on an Apple MacBook M1 chip with 16 GB of unified memory with no GPU required. Each tournament takes between 5 and 10 min and generates approximately 12 MB of summarization data and around 100 MB of game logs.

3.1. Experimental Design and Data Collection Protocol

As stochasticity is inherent in Poker, we design and conduct an extensive set of experiments to control for randomness. A game G consists of four rounds, involving a set of N players

P = {p_{1} (m_{1}), p_{2} (m_{2}) \dots, p_{N} (m_{N})}

, and with each player

p_{i}

associated with its own strategy model

m_{i}

. In this setup,

p_{1}

is always the PokerOWL user (i.e.,

p_{1} (m_{1}) = p_{1} (D Q N)

), and in our experiments, corresponds to a deep reinforcement learning-based agent called AgentDQN, the details of which were provided earlier. The other players are selected from among AgentR, AgentD, and AgentP, depending on the specific experiment being conducted. Note that the same agent can be associated with multiple players, as an agent here just represents the strategy that the player adopts. For example, both

p_{2}

and

p_{3}

can be associated with AgentR (=

m_{2}

=

m_{3}

), which simply means that both players are choosing to act randomly. However, all players are still acting independently (in a non-coordinated fashion) of each other, as no player has a priori knowledge of the assigned strategies of the other players.

The total player count N can vary from 2 to 10 in a typical Poker setting. In these experiments, we assume a total of

N = 8

players. A tournament T is defined as a sequence of games

(G_{1}, \dots, G_{| T |})

, beginning with a fixed buy-in amount for each player and concluding when one of three conditions is met: (i) the user

p_{1}

loses, (ii) one player (which may or may not be

p_{1}

) wins against all others, or (iii) the game reaches the maximum game limit (currently set at 30). A tournament might have different game counts (

| T |

) depending on which of the above three conditions triggers, but it can never exceed the maximum game limit. We define each experiment

E X

to consist of 100 independent tournaments T (each initiated with a different random seed), where the first 20 tournaments

(T_{1}, \dots, T_{20})

adhere to the original non-novel setting (called the pre-novelty phase), while the subsequent 80 tournaments

(T_{21}, \dots, T_{100})

inject a predefined novelty-ruleset (the post-novelty phase). The same novelty

N

is always introduced at the beginning of each game in the post-novelty phase. Finally, we define an evaluation

E V

as comprising three independent experiments

E X

, each with different player sets P. As discussed earlier, strategies for P are combinations of (AgentR, AgentD, AgentP). The combinations used in each evaluation are set as (7,0,0), (0,7,0), and (0,0,7), respectively.

Using the logging facilities in PokerOWL, we log each player’s cash balance and behavioral summary throughout the tournament. For a game G in tournament T and player

p_{i}

, we systematically collect the cash balance at the game’s conclusion, denoted as

c (p_{i})

, along with the count of actions executed during the game, formally represented as

a_{j} (p_{i}, G)

, where j denotes the action index ranging from 1 to 6, corresponding to the action space. Consequently, we construct a cash vector of size

(N, |T|)

and action vector of size

(N, |T|, 6)

for each tournament, where N indicates the total number of players and

|T|

denotes the total number of games in T.

3.2. Evaluation Metrics

To measure the effect of novelty on AgentDQN, we profile the overall performance and behavior of both AgentDQN and the background agents, comparing differences between pre-novelty and post-novelty conditions. Given the different time granularities (game, tournament, experiment) involved in the experimental design, we aggregate and pool results accordingly. We consider two metrics: cash balance and action profile. Cash balance reflects performance outcomes, while action profile captures behavioral changes in response to novelty. These metrics are computed at the game, tournament, and experiment levels, as outlined in Table 1. We also define the metrics at two main aggregation levels: the experiment level and the tournament level, each providing distinct insights into agent performance and behavior. At the experiment level, we focus on overall trends, calculating AgentDQN’s mean cash balance and its standard error across the 80 post-novelty tournaments. However, the tournament-level metrics offer a more granular perspective by calculating the agent’s final cash balance and their action profile, which consists of the normalized frequencies of each of their six possible actions.

4. Results

We begin by baselining AgentDQN’s performance in pre-novelty scenarios by conducting experiments with several combinations of background agents (1–9 instances of AgentR, AgentD, and AgentP). The results, shown in Figure 4a, reveal that AgentDQN consistently outperforms all tested numbers of AgentR and AgentP. Moreover, AgentDQN’s cash balance correlates positively with the presence of these two background agents in a game. As AgentDQN has already ‘learned’ how to defeat these two agent models, more background agents increase the probability of winning. However, it struggles to defeat AgentD, primarily due to AgentD’s tendency to call or check, thereby revealing limited information about its hand. Nevertheless, AgentDQN exhibits resilience in such scenarios, maintaining its position without succumbing to rapid losses. This presents AgentDQN’s strengths and limitations across different opponent compositions, confirming the assumption that it is an appropriate baseline that can be further tested in OWL experiments (that include post-novelty tournaments).

Figure 5 shows the performance of four agents (measured by the cash balance metric) when AgentDQN competes against different combinations of background agents. AgentDQN was trained with nine AgentP, leading to slightly better performance than other agents in the combinations dominated by AgentP, such as 1R1D5P, 2R2D3P 1R1D7P, and 2R2D5P. Similar trends are observed when competing against single-type agents: AgentDQN significantly outperforms AgentR but performs similarly to AgentD.

To further evaluate AgentDQN’s performance within a novel environment, we set the number of background players to seven, resulting in an eight-person Poker table, which mirrors a common size in real-world Poker settings. Next, we test it in an environment with a single novelty injected. AgentDQN was evaluated against seven instances each of AgentR, AgentP, and AgentD. The results for these background agents are shown in Figure 4b,c,d, respectively. Poker is a zero-sum game, and while some novelties might disrupt this zero-sum assumption (for example, the external incentive in the LuckySeven novelty, represented as cg4 on the x-axis; see the Supplementary Materials), an opposite cash trend between AgentDQN and background agents is observable. We found that about half of the novelties have negatively impacted AgentDQN. However, novelties have a muted effect on AgentR compared to the other two agents, suggesting that a relatively strong strategy might mitigate the novelty’s effect. Moreover, the influence of novelty varies across different agents’ strategies. For instance, CardDistLow (c2) ranks as the third-most influential novelty when playing against AgentP; yet it has diminished impact when playing against AgentD and AgentR.

Understanding that novelty could have a negative impact, we further used PokerOWL’s logging facilities to investigate the different agents’ action profiles to see how players respond to novelty. We selected GameFoldRestrict (ac1) as an example novelty, as it ranks among the top five novelties with the strongest negative impacts when AgentDQN competes against AgentR and AgentP, while not being as affected when competing against AgentD. In addition to AgentDQN, we implemented a simple AgentOWE to evaluate the difference in novelty handling. Specifically, AgentOWE applies the heuristic strategy that detects novelty when the current tournament cash falls below the moving average over a 10-tournament window. Upon detection, it switches to an alternative DQN trained under the same procedure but within the GameFoldRestrict environment.

The results are depicted in Figure 6. In the pre-novelty tournaments (tournament numbers up to 20), AgentDQN adopts a more conservative approach against AgentR, frequently folding its hand. However, it displays a more aggressive stance against AgentP and AgentD, with betting frequencies of 40% and 60% respectively among all actions—all of which fall within the allowable action set defined by the environment. Interestingly, adopting a similar aggressive approach does not result in the same performance; we only observe that it works well against AgentP. For AgentOWE, the heuristic novelty detection strategy tends to trigger early, detecting novelty at tournament 11 (against AgentP) and at 12 (against AgentR and AgentD). Following switching to an alternative DQN after detection, the agent exhibits extremely conservative behavior during the remaining pre-novelty phase, folding in the majority of actions. This indicates that AgentOWE has not yet learned the “no folding allowed” rules, as it continues to fold in pre-novelty tournaments. Overcoming this limitation will require more advanced learning or inference methods. In contrast, during the post-novelty phase, AgentOWE exhibits different action profiles compared to the pre-novelty phase as a result of the changed environment constraints. For example, when playing against AgentD, the agent tends to adopt a more aggressive strategy. In terms of cash balance, when AgentOWE competes against AgentR and AgentP, both agents experience a significant performance drop; however, AgentOWE consistently performs slightly better than AgentDQN. This shows the complexity of OWL, especially in a multi-agent setting: the appropriate adaptation depends both on the novelty injected by the VoE Novelty Generator, as well as the other competing strategies in play.

In the post-novelty phase, GameFoldRestrict was introduced, which prevented all players from folding during the tournament. This is reflected in the absence of black dots in any post-novelty tournaments. As a result, AgentDQN’s performance dropped to half when competing against AgentR and to one-third when competing against AgentP, but it increased by 20% when competing against AgentD. The consequence of this novelty is that allowable actions are primarily limited to (check, bet) and (call, raise), effectively forcing players to adopt an aggressive stance.

This effect is particularly pronounced in AgentR, where a halved probability of adopting an aggressive strategy exists. With seven AgentR agents at the table, each betting round typically reaches the maximum raise count, leading to the tournament ending in a single game and resulting in no error bars in the action profile. On the other hand, AgentP continues to adhere to a conservative approach, opting to check or call whenever possible. AgentD, already committed to an aggressive strategy, remains unaffected by this novelty. However, the novelty prompts AgentDQN to become more aggressive, resulting in a slightly improved performance.

In addition to comparing three single-type background agents, we also provide the action distributions of AgentDQN against the mixed agent combinations. Figure 7 shows the action profile of selected four players in an environment with the novelty GameFoldRestrict, where AgentDQN competes against three AgentP, two AgentD, and two AgentR. The average cash balance of AgentDQN dropped from 235.12 ± 69.58 to 214.0 ± 54.44 from pre-novelty to post-novelty, while AgentP’s performance changed from 242.63 ± 55.03 to 197.20 ± 51.92, AgentD’s from 278.39 ± 94.97 to 197.19 ± 54.54, and AgentR’s performance changed from 39.62 ± 9.72 to 200 ± 53.5. As described in the main text, the fold is prohibited in the post-novelty tournaments. Interestingly, AgentR, which performs random actions, outperforms the other three types of agents in the post-novelty tournaments, indicating that the original “knowledge or strategies” are not applicable to this novel situation.

Figure 8 shows the action profile of selected four players in an environment with the novelty CardDistHigh, where AgentDQN competes against three AgentP, two AgentD, and two AgentR. The average cash balance of AgentDQN dropped from 287.5 ± 86.01 to 134.65 ± 28.29 from pre-novelty to post-novelty, while AgentP’s performance changed from 203.92 ± 37.45 to 295.29 ± 41.34, AgentD’s from 259.50 ± 83.95 to 226.78 ± 43.42, and AgentR’s performance changed from 89.5 ± 30.63 to 62.48 ± 10.74. This novelty alter card distribution to the high-value cards, which benefits AgentP as it strategizes to bet on good hands. The corresponding actions are also shown in the top-right sub-figure, where we observe an increase in raise frequency in the post-novelty tournaments.

Further analysis shows that combining different novelties can have a non-linear impact on performance. In Figure 9, we illustrate performance of AgentDQN when 250 depth-two composite novelties, constructed from allowed combinations of 25 primitive novelties, are injected. Figure 5 shows that, when facing AgentP, AgentDQN is most likely to suffer from novelties related to the card distribution (c*), agent strategies (ag*), and action restrictions (ac*). In contrast, the card distribution is not the primary concern when playing against AgentR. The effects of composite novelty can correlate with those of individual primitives. For example, when AgentDQN plays against AgentP under the combination of the AddAgentConservative (ag3) primitive and another primitive related to game element modifications (g*) and game-concluding modifications (cg*), performance varies significantly, ranging from 95.78 to 775.09. When playing against AgentR, AgentDQN generally performs well since AgentR selects actions randomly. However, certain combinations of primitives, such as AddAgentAggressive (ag4) and NoFreeLunch (ac2), still significantly deteriorate AgentDQN’s performance to a low range of 42.81 ± 20.41. Interestingly, composite novelties can also significantly enhance performance without requiring any adaptation. For example, LuckySeven (cg4) encourages players to remain in the game if they hold hole cards valued below 7. Additionally, CardDistLow (c2) modifies the deck distribution to favor lower-numbered cards. Consequently, players who identify this incentive and continue playing are more likely to benefit.

5. Discussion and Limitations

Our results suggest that even well-trained reinforcement learning agents, such as the Deep Q-Network (DQN) baseline used here, can perform strongly under standard, in-distribution conditions yet exhibit marked vulnerability when confronted with structural Violations of Expectation (VoEs). These findings align with prior observations in both single-agent and multi-agent OWL research: robustness under novelty is not an automatic byproduct of strong baseline performance. Instead, robust performance requires explicit mechanisms for novelty detection, rapid characterization, and adaptive policy adjustment.

Several promising direction could be explored at the simplest level, novelty detection could use performance environment that can monitor when an agent’s cash balance falls below a moving average of recent tournaments—while more sophisticated approaches could train auxiliary models to analyze environmental observations or employ heuristic-based or meta-learning. Proactive methods such as continual learning architectures could further enable agents to explore new behaviors while preserving prior knowledge. PokerOWL’s systematic novelty injection framework is ideally suited for evaluating such mechanisms across novelties of varying complexity. Future work should focus on implementing and benchmarking explicit OWL mechanisms within PokerOWL, moving toward agents that can detect, respond to, and generalize from VoEs in open-world settings.

The PokerOWL environment provides a controlled yet richly variable testbed for investigating these mechanisms at scale. Unlike ad-hoc novelty injection approaches, PokerOWL supports repeatable and systematic experiments under varying agent populations, VoE types, and novelty depths. This capability enables researchers to quantify not just the magnitude of performance degradation under novelty, but also the interaction effects between novelty type, opponent strategy profiles, and the agent’s adaptation capabilities. For instance, our experiments reveal that some novelties (e.g., GameFoldRestrict) can drastically alter the action space dynamics, neutralizing previously advantageous strategies, while others (e.g., CardDistHigh) disproportionately benefit specific opponent types. Such effects are often non-linear and cannot be inferred from evaluating individual novelty primitives in isolation, underscoring the need for composite novelty testing.

Importantly, PokerOWL’s black-box treatment of agents—requiring only API-based action/observation exchanges—means that it can evaluate a wide spectrum of agent architectures, from hand-coded heuristic players to large, pretrained policy models. This modularity is particularly relevant for studying adaptation in real time. In principle, agents could be augmented with novelty detection subsystems, meta-reasoning layers, or policy re-weighting modules that respond dynamically within a tournament, without re-training. Future work could exploit PokerOWL to systematically benchmark such approaches against diverse novelty classes.

5.1. Limitations

Several limitations of the present study should be noted. First, the current implementation focuses exclusively on the limit Texas Hold’em variant. The constrained betting structure reduces action space complexity, which is advantageous for isolating novelty effects but may omit key strategic dimensions present in no-limit formats or other poker variants. Extending PokerOWL to these richer settings would test whether adaptation mechanisms generalize to environments with larger, more continuous action spaces and higher-variance payoff structures.

Second, we evaluated only a small set of opponent archetypes—three fixed-strategy heuristic agents and one learning agent—chosen to highlight specific contrasts in behavior and susceptibility to novelty. While these baselines serve to demonstrate PokerOWL’s functionality, they do not represent the diversity of agents used in current multi-agent RL research, such as model-based agents, opponent-modeling strategies, or agents with explicit meta-learning capabilities. Incorporating a more heterogeneous population of agents could better capture the ecological validity of open-world multi-agent settings.

Third, the VoE library, though already extensive with 25 primitives, is necessarily incomplete. Our primitives target structural novelties that preserve action/observation syntax, but in real-world OWL settings, violations can involve shifts in sensory modalities, abrupt rule changes that alter the observation encoding, or adversarial perturbations. Expanding PokerOWL to support syntax-level VoEs in a controlled manner, potentially via a safe fallback mechanism to avoid agent crashes, would further broaden the research scope.

Fourth, the present work does not yet integrate human baselines for novelty detection and adaptation in poker. Human-agent comparison is essential for understanding the gap between machine and human OWL performance, especially in multi-agent games with hidden information. PokerOWL’s visual interface provides the necessary scaffolding for such studies, but human-subject experiments remain as future work.

Finally, our experiments did not implement or compare explicit OWL strategies beyond the baseline DQN. As such, while we demonstrate the magnitude of degradation possible under novelty, we do not yet offer prescriptive guidance on which algorithmic adaptations are most effective. Follow-on studies could benchmark meta-learning, policy-switching, and continual learning architectures within PokerOWL, and assess their computational costs versus adaptation benefits.

5.2. Future Directions

Future research should explore (i) extending PokerOWL to additional poker variants and action spaces, (ii) scaling up agent diversity to include adaptive and meta-reasoning agents, (iii) expanding the VoE library to capture multi-modal and syntax-level novelties, and (iv) conducting mixed human–AI experiments to establish adaptation baselines. Moreover, the environment could be coupled with automated VoE curriculum generation, allowing agents to be trained in progressively more complex novelty landscapes. Such studies will help bridge the gap between current narrow policies and truly open-world-capable multi-agent systems.

In sum, PokerOWL offers a flexible, systematic, and extensible environment for studying open-world learning under controlled yet challenging conditions. By enabling reproducible novelty injection and fine-grained logging, it provides a foundation for both diagnosing the brittleness of existing agents and accelerating the development of adaptive, resilient AI systems.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16115458/s1.

Author Contributions

Conceptualization, M.-H.C. and M.K.; methodology, M.-H.C.; software, M.-H.C. and N.N.; validation, M.-H.C. and N.N.; formal analysis, M.-H.C. and N.N.; data curation, M.-H.C.; writing—original draft preparation, M.-H.C. and M.K.; writing—review and editing, N.N. and M.K.; visualization, M.-H.C. and N.N.; supervision, M.K.; project administration, M.K.; funding acquisition, M.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by DARPA grant number W911NF2020003.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Schrittwieser, J.; Antonoglou, I.; Hubert, T.; Simonyan, K.; Sifre, L.; Schmitt, S.; Guez, A.; Lockhart, E.; Hassabis, D.; Graepel, T.; et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature 2020, 588, 604–609. [Google Scholar] [CrossRef]
Brown, N.; Sandholm, T. Superhuman AI for multiplayer poker. Science 2019, 365, 885–890. [Google Scholar] [CrossRef]
Epstein, Z.; Hertzmann, A.; Investigators of Human Creativity; Akten, M.; Farid, H.; Fjeld, J.; Frank, M.R.; Groh, M.; Herman, L.; Leach, N.; et al. Art and the science of generative AI. Science 2023, 380, 1110–1111. [Google Scholar] [CrossRef]
Margoni, F.; Surian, L.; Baillargeon, R. The violation-of-expectation paradigm: A conceptual overview. Psychol. Rev. 2023, 131, 716. [Google Scholar] [CrossRef] [PubMed]
Lee, S.; Park, J.; Suk, H.; Kim, T.; Yadav, P.; Kim, S. An open-world novelty generator for authoring reinforcement learning environment of standardized toolkits. In Proceedings of the International Conference on Multi-Disciplinary Trends in Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2021; pp. 27–33. [Google Scholar]
Zhang, Y.; Kang, B.; Hooi, B.; Yan, S.; Feng, J. Deep long-tailed learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10795–10816. [Google Scholar] [CrossRef]
Liu, Z.; Miao, Z.; Zhan, X.; Wang, J.; Gong, B.; Yu, S.X. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2537–2546. [Google Scholar]
Tu, J.; Li, H.; Yan, X.; Ren, M.; Chen, Y.; Liang, M.; Bitar, E.; Yumer, E.; Urtasun, R. Exploring adversarial robustness of multi-sensor perception systems in self driving. arXiv 2021, arXiv:2101.06784. [Google Scholar]
Wang, J.; Wang, X.; Shen, T.; Wang, Y.; Li, L.; Tian, Y.; Yu, H.; Chen, L.; Xin, J.; Wu, X.; et al. Parallel vision for long-tail regularization: Initial results from IVFC autonomous driving testing. IEEE Trans. Intell. Veh. 2022, 7, 286–299. [Google Scholar] [CrossRef]
Doctor, K.; Task, C.; Kildebeck, E.; Kejriwal, M.; Holder, L.; Leong, R. Toward Defining a Domain Complexity Measure Across Domains. arXiv 2023, arXiv:2303.04141. [Google Scholar] [CrossRef]
Bolte, J.A.; Bar, A.; Lipinski, D.; Fingscheidt, T. Towards corner case detection for autonomous driving. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV); IEEE: New York, NY, USA, 2019; pp. 438–445. [Google Scholar]
Kroemer, O.; Niekum, S.; Konidaris, G. A review of robot learning for manipulation: Challenges, representations, and algorithms. J. Mach. Learn. Res. 2021, 22, 1395–1476. [Google Scholar]
Parmar, J.; Chouhan, S.; Raychoudhury, V.; Rathore, S. Open-world machine learning: Applications, challenges, and opportunities. ACM Comput. Surv. 2023, 55, 205. [Google Scholar] [CrossRef]
Zhu, F.; Ma, S.; Cheng, Z.; Zhang, X.Y.; Zhang, Z.; Liu, C.L. Open-world machine learning: A review and new outlooks. arXiv 2024, arXiv:2403.01759. [Google Scholar]
Kejriwal, M.; Kildebeck, E.; Steininger, R.; Shrivastava, A. Challenges, evaluation and opportunities for open-world learning. Nat. Mach. Intell. 2024, 6, 580–588. [Google Scholar] [CrossRef]
Hammersborg, P.; Strümke, I. Reinforcement learning in an adaptable chess environment for detecting human-understandable concepts. IFAC-PapersOnLine 2023, 56, 9050–9055. [Google Scholar] [CrossRef]
Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 2018, 362, 1140–1144. [Google Scholar] [CrossRef] [PubMed]
Tomašev, N.; Paquet, U.; Hassabis, D.; Kramnik, V. Reimagining chess with AlphaZero. Commun. ACM 2022, 65, 60–66. [Google Scholar] [CrossRef]
Islam, S.; Chiu, M.H.; Bonjour, T.; de Oliveira, R.; Bhargava, B.; Kejriwal, M. A Q-learning Novelty Search Strategy for Evaluating Robustness of Deep Reinforcement Learning in Open-world Environments. IEEE Intell. Syst. 2024, 40, 5–15. [Google Scholar] [CrossRef]
Boult, T.E.; Windesheim, N.M.; Zhou, S.; Pereyda, C.; Holder, L.B. Weibull-Open-World (WOW) Multi-Type Novelty Detection in CartPole3D. Algorithms 2022, 15, 381. [Google Scholar] [CrossRef]
Lu, J.; Liu, A.; Dong, F.; Gu, F.; Gama, J.; Zhang, G. Learning under concept drift: A review. IEEE Trans. Knowl. Data Eng. 2018, 31, 2346–2363. [Google Scholar] [CrossRef]
Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. (CSUR) 2009, 41, 15. [Google Scholar] [CrossRef]
Boult, T.; Grabowicz, P.; Prijatelj, D.; Stern, R.; Holder, L.; Alspector, J.; Jafarzadeh, M.M.; Ahmad, T.; Dhamija, A.; Li, C.; et al. Towards a unifying framework for formal theories of novelty. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: Washington, DC, USA, 2021; Volume 35, pp. 15047–15052. [Google Scholar]
Balloch, J.; Lin, Z.; Hussain, M.; Srinivas, A.; Wright, R.; Peng, X.; Kim, J.; Riedl, M. Novgrid: A flexible grid world for evaluating agent response to novelty. arXiv 2022, arXiv:2203.12117. [Google Scholar] [CrossRef]
Kejriwal, M.; Thomas, S. A multi-agent simulator for generating novelty in monopoly. Simul. Model. Pract. Theory 2021, 112, 102364. [Google Scholar] [CrossRef]
Matthews, M.; Beukman, M.; Ellis, B.; Samvelyan, M.; Jackson, M.; Coward, S.; Foerster, J. Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning. arXiv 2024, arXiv:2402.16801. [Google Scholar]
Samvelyan, M.; Khan, A.; Dennis, M.; Jiang, M.; Parker-Holder, J.; Foerster, J.; Raileanu, R.; Rocktäschel, T. MAESTRO: Open-ended environment design for multi-agent reinforcement learning. arXiv 2023, arXiv:2303.03376. [Google Scholar]
Fan, L.; Wang, G.; Jiang, Y.; Mandlekar, A.; Yang, Y.; Zhu, H.; Tang, A.; Huang, D.A.; Zhu, Y.; Anandkumar, A. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Adv. Neural Inf. Process. Syst. 2022, 35, 18343–18362. [Google Scholar]
Siler, K. Social and psychological challenges of poker. J. Gambl. Stud. 2010, 26, 401–420. [Google Scholar] [CrossRef]
Gillies, D.; Mayberry, J.; Von Neumann, J. Two Variants of Poker; Princeton University Press: Princeton, NJ, USA, 1953. [Google Scholar]
Billings, D.; Papp, D.; Schaeffer, J.; Szafron, D. Poker as a Testbed for AI Research. In Proceedings of the Advances in Artificial Intelligence: 12th Biennial Conference of the Canadian Society for Computational Studies of Intelligence, AI’98, Vancouver, BC, Canada, 18–20 June 1998; Proceedings 12; Springer: Berlin/Heidelberg, Germany, 1998; pp. 228–238. [Google Scholar]
Zhao, E.; Yan, R.; Li, J.; Li, K.; Xing, J. AlphaHoldem: High-performance artificial intelligence for heads-up no-limit poker via end-to-end reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: Washington, DC, USA, 2022; Volume 36, pp. 4689–4697. [Google Scholar]
Teófilo, L.F.; Passos, N.; Reis, L.P.; Cardoso, H.L. Adapting strategies to opponent models in incomplete information games: A reinforcement learning approach for poker. In Proceedings of the International Conference on Autonomous and Intelligent Systems; Springer: Berlin/Heidelberg, Germany, 2012; pp. 220–227. [Google Scholar]
Zha, D.; Lai, K.H.; Cao, Y.; Huang, S.; Wei, R.; Guo, J.; Hu, X. Rlcard: A toolkit for reinforcement learning in card games. arXiv 2019, arXiv:1910.04376. [Google Scholar]
Dahl, F.A. A reinforcement learning algorithm applied to simplified two-player Texas Hold’em poker. In Proceedings of the European Conference on Machine Learning; Springer: Berlin/Heidelberg, Germany, 2001; pp. 85–96. [Google Scholar]
Fan, J.; Wang, Z.; Xie, Y.; Yang, Z. A theoretical analysis of deep Q-learning. In Proceedings of the Learning for Dynamics and Control; PMLR: Cambridge, MA, USA, 2020; pp. 486–489. [Google Scholar]
Huang, S.; Gallouédec, Q.; Felten, F.; Raffin, A.; Dossa, R.F.J.; Zhao, Y.; Sullivan, R.; Makoviychuk, V.; Makoviichuk, D.; Danesh, M.H.; et al. Open RL Benchmark: Comprehensive Tracked Experiments for Reinforcement Learning. arXiv 2024, arXiv:2402.03046. [Google Scholar] [CrossRef]
Gymnasium Wrappers. Available online: https://gymnasium.farama.org/api/wrappers/ (accessed on 20 May 2026).
Bonjour, T.; Haliem, M.; Alsalem, A.; Thomas, S.; Li, H.; Aggarwal, V.; Kejriwal, M.; Bhargava, B. Decision making in monopoly using a hybrid deep reinforcement learning approach. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 6, 1335–1344. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]

Figure 1. Schematic illustration of the PokerOWL framework, including the game environment, VoE or novelty generator, logs, and statistical OWL-based analysis based on the logs.

Figure 2. Performance and action profile of AgentDQN under varying learning rates.

Figure 3. Illustration of the PokerOWL visualization. The red text and the red box are used to explain the parameters in this paper, which are not visible in PokerOWL.

Figure 4. (a) Pooled estimates of the performance (using the cash balance metric) of AgentDQN (red) against varying numbers of AgentR (green), AgentD (blue), and AgentP (purple) ‘background’ agents. Each error bar represents one standard error. For example, 7 P on the x-axis indicates that AgentDQN was playing against 7 AgentP players. The star symbol indicates that AgentDQN’s cash balance is significantly larger than the other models’. The remaining plots show the pooled estimates of the performance of AgentDQN against (seven other) (b) AgentR, (c) AgentD, and (d) AgentP players, respectively, in post-novelty tournaments. The leftmost column (with a gray background) summarizes the pre-novelty estimate against the same set of players, with the pooled average serving as the baseline and plotted as a black dashed horizontal line. In (b–d), the x-axis represents the identifier of the novelty injected into the tournament (with descriptions enumerated in the Supplementary Materials), sorted by declining performance of the AgentDQN for ease of visualization.

Figure 5. Cash performance comparisons between AgentDQN (stars) and different background agent combinations (colored dots).

Figure 6. Cash balance and action profiles of agents when the novelty GameFoldRestrict (ac1) is injected. The first, second, and third rows represent experiments against AgentR, AgentP, and AgentD, respectively. The first column shows the cash balance; the left (gray background) and right panels show the average bankroll for pre- vs. post-novelty and pre vs. post-novelty detection tournaments, respectively. The second and third columns detail AgentOWE’s and AgentDQN’s action profiles, and the fourth column presents the background agents’ action profiles.

Figure 7. Action distribution comparison of AgentDQN with AgentP, AgentD, and AgentR under the novelty GameFoldRestrict.

Figure 8. Action distribution comparison of AgentDQN with AgentP, AgentD, and AgentR under the novelty CardDistHigh.

Figure 9. Performance of AgentDQN against (a) AgentP and (b) AgentR when encountering depth-two novelties (represented by the two novelties in the row and column), and with each cell representing average cash balance in post-novelty tournaments. The upper triangle shows AgentDQN’s cash balance, while the lower triangle shows the background agents’ cash balance. The row displays the performance of AgentDQN, and the column displays the background agents’ performance, in the depth-one novelty scenarios. The diagonal entries are prohibited due to the limitation that primitives from the same novelty category cannot be combined. Because

a g 1

represents the novelty AgentExchange, where the original agents are replaced with other agent types, it is not reported in this specific experiment.

Figure 9. Performance of AgentDQN against (a) AgentP and (b) AgentR when encountering depth-two novelties (represented by the two novelties in the row and column), and with each cell representing average cash balance in post-novelty tournaments. The upper triangle shows AgentDQN’s cash balance, while the lower triangle shows the background agents’ cash balance. The row displays the performance of AgentDQN, and the column displays the background agents’ performance, in the depth-one novelty scenarios. The diagonal entries are prohibited due to the limitation that primitives from the same novelty category cannot be combined. Because

a g 1

represents the novelty AgentExchange, where the original agents are replaced with other agent types, it is not reported in this specific experiment.

Table 1. Summary of evaluation metrics, including formulae for cash balance and action profiles, at aggregation levels of experiment and tournament. The action profile metric captures behavioral response across all games in a tournament.

Metric and Aggregation Level	Formula/Description
AgentDQN Cash Balance (Experiment Level)	Mean over 80 post-novelty tournaments: $μ_{D Q N}^{p o s t} = \frac{1}{80} \sum_{T_{t} \in p o s t} c (p_{1} (D Q N), \| T_{t} \|)$ Standard deviation: $s_{D Q N}^{p o s t}$ Standard error: $s e_{D Q N}^{p o s t} = \frac{s_{D Q N}^{p o s t}}{\sqrt{80}}$
Background Agent Cash Balance (Experiment Level)	Mean of background averages: $μ_{B}^{p o s t} = \frac{1}{80} \sum_{T_{t} \in p o s t} μ_{B} (\| T_{t} \|)$ Pooled standard error: $s e_{B}^{p o s t} = \sqrt{\frac{1}{80} \sum_{T_{t} \in p o s t} (s_{B} (\| T_{t} {\|))}^{2}}$
Cash Balance (Tournament Level)	Agent’s final balance: $c (p_{i} (m_{i}), \| T \|)$ Background average: $μ_{B} (T) = \frac{1}{N - 1} \sum_{m_{i} \in B} c (p_{i} (m_{i}), \| T \|)$ Standard deviation: $s_{B} (T) = \sqrt{\frac{1}{N - 2} \sum {(c - μ)}^{2}}$
Action Profile (Tournament Level)	Normalized frequency of each action $j \in {1 \dots 6}$ : $a_{j} (p_{i} (m_{i}), T_{t}) = \frac{\sum_{G \in T_{t}} a_{j} (p_{i}, G)}{\sum_{G \in T_{t}} \sum_{a \in A} a (p_{i}, G)}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chiu, M.-H.; Nananukul, N.; Kejriwal, M. PokerOWL: A Multi-Agent Poker Environment for Benchmarking Open-World Learning. Appl. Sci. 2026, 16, 5458. https://doi.org/10.3390/app16115458

AMA Style

Chiu M-H, Nananukul N, Kejriwal M. PokerOWL: A Multi-Agent Poker Environment for Benchmarking Open-World Learning. Applied Sciences. 2026; 16(11):5458. https://doi.org/10.3390/app16115458

Chicago/Turabian Style

Chiu, Min-Hsueh, Navapat Nananukul, and Mayank Kejriwal. 2026. "PokerOWL: A Multi-Agent Poker Environment for Benchmarking Open-World Learning" Applied Sciences 16, no. 11: 5458. https://doi.org/10.3390/app16115458

APA Style

Chiu, M.-H., Nananukul, N., & Kejriwal, M. (2026). PokerOWL: A Multi-Agent Poker Environment for Benchmarking Open-World Learning. Applied Sciences, 16(11), 5458. https://doi.org/10.3390/app16115458

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PokerOWL: A Multi-Agent Poker Environment for Benchmarking Open-World Learning

Abstract

1. Background

2. PokerOWL Architecture

2.1. Environment

2.2. Agents

2.2.1. Background Agents

2.2.2. Deep Q-Network Agent

2.3. Visual Interface

3. Experimental Setup

3.1. Experimental Design and Data Collection Protocol

3.2. Evaluation Metrics

4. Results

5. Discussion and Limitations

5.1. Limitations

5.2. Future Directions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI