Combinatorial Game Theory and Reinforcement Learning in Cumulative Tic-Tac-Toe via Evaluation Functions

Li, Kai; Zhu, Wei

doi:10.3390/stats9020028

Open AccessArticle

Combinatorial Game Theory and Reinforcement Learning in Cumulative Tic-Tac-Toe via Evaluation Functions

by

Kai Li

^1,2,*

and

Wei Zhu

^1,2,3,*

¹

Department of Applied Mathematics and Statistics, State University of New York at Stony Brook, Stony Brook, NY 11794, USA

²

Institute for Advanced Computational Science, State University of New York at Stony Brook, Stony Brook, NY 11794, USA

³

AI Innovation Institute, State University of New York at Stony Brook, Stony Brook, NY 11794, USA

^*

Authors to whom correspondence should be addressed.

Stats 2026, 9(2), 28; https://doi.org/10.3390/stats9020028

Submission received: 9 February 2026 / Revised: 4 March 2026 / Accepted: 6 March 2026 / Published: 10 March 2026

Download

Browse Figures

Versions Notes

Abstract

We introduce cumulative tic-tac-toe, a novel variant of the classic

3 \times 3

tic-tac-toe game in which play continues until the board is completely filled. Each player’s final score is determined by the total number of three-in-a-row sequences they form. Using combinatorial game theory (CGT), we establish that under optimal play, the game is a draw, and we characterize its theoretical properties. To empirically validate and optimize practical play, we develop a reinforcement learning (RL) framework based on temporal-difference (TD) learning, which is enhanced with a domain-informed evaluation function to accelerate convergence. The experimental results show that our triplet-coverage difference (TCD) evaluation function reduces the average number of training episodes by approximately 23.1% compared with a random-initialization baseline, a statistically significant improvement at the 5% significance level. These results demonstrate the efficiency of our CGT–RL approach for cumulative tic-tac-toe and suggest that similar methods may be useful for analyzing related combinatorial games. We also discuss potential analogies in domains such as competitive resource allocation and coalition formation, illustrating how cumulative-scoring games connect abstract game-theoretic ideas to practical sequential decision problems.

Keywords:

cumulative tic-tac-toe; combinatorial game theory; reinforcement learning; temporal-difference learning; triplet-coverage difference

1. Introduction

Games are ubiquitous as both entertainment and scholarly objects of investigation. Researchers in many disciplines study both the formal structure of games and how people play them, contributing to the interdisciplinary field of game studies. The continued rise of game studies reflects the cultural importance of games in all formats, media, and interaction modes [1]. A notable example is the multi-volume series of mathematical strategies for a variety of games [2,3,4,5]. Insights from game studies now inform political science [6], psychology [7], sociology [8], linguistics [9], education [10], and computer science [11]. Methodologies span qualitative, quantitative, and mixed-methods approaches [12]. Historically, games of chance shaped the early development of probability theory [13], whereas many enduring puzzles have deep roots in graph theory and other branches of applied mathematics [14].

Two-player games such as Hex [15], chess [16], and checkers [17] pose well-defined challenges governed by explicit rules and social interactions between opponents. Understanding optimal play, which refers to how rational agents should act to maximize their chances of winning, is a central concern. Game theory offers mathematical models of strategic interaction and has shaped research in education [18], behavioral science [19], economics [20], politics [21], and evolutionary biology [22]. Important concepts in game theory include zero-sum games, where players’ payoffs sum to zero, and perfect information games, where every prior move is observed by all players. Classical tools such as minimax search and backward induction remain fundamental for analyzing such games.

However, traditional game theory often addresses settings with incomplete information, where key details are hidden from the players. For example, an iterative algorithm for computing an exact Nash equilibrium for two-player zero-sum games with imperfect information has been proposed [23]. Combinatorial game theory (CGT) complements this perspective by focusing on two-player, zero-sum, perfect-information games of pure skill, whose outcomes are restricted to win, loss, or draw. Several benchmark games have been completely solved via CGT techniques [24]. However, most board games remain theoretically challenging because the exponential growth in the number of possible move sequences renders exhaustive case analysis computationally infeasible [25]. Contemporary approaches, therefore, turn to modern machine learning methods, especially reinforcement learning, for tractable approximations.

Reinforcement learning (RL), one of the three core branches of machine learning alongside supervised and unsupervised learning, studies how agents learn to act in dynamic environments via trial and error [26]. RL has advanced in autonomous driving [27], career planning [28], robotics [29], neuroscience [30], quantitative finance [31], computational applied mathematics [32], and many other domains [33]. In recent years, RL methods have achieved superhuman performance in many games by combining tabular techniques, such as dynamic programming (DP), Monte Carlo (MC), and temporal-difference (TD) learning, with scalable function approximation [34]. These algorithms yield fast, high-accuracy estimates even in large state spaces once they are considered intractable.

Tic-tac-toe typifies a two-player, zero-sum, deterministic, perfect-information game. Players alternately mark empty cells on a

3 \times 3

grid, aiming to align three symbols horizontally, vertically, or diagonally. If all nine cells are filled without a line, the game is a draw, guaranteed by optimal play. Despite its simplicity, tic-tac-toe underpins classroom demonstrations in computer science [26], mathematics education [35], and game theory [36]. Because tic-tac-toe is fully solved theoretically, researchers routinely use it to test new algorithms and evaluate decision-making under uncertainty [37]. In particular, RL algorithms have been successfully applied to tic-tac-toe [38,39,40,41,42]. Building on these results, RL solutions for tic-tac-toe have also been adapted to challenges such as enabling hypermedia agents to perform transfer learning in web environments [43], exploring chain-of-thought reasoning in large language models [44], robotic control tasks [45], and integrating language understanding with gameplay [46].

Numerous variants of tic-tac-toe broaden the original rules, altering board size, introducing imperfect information, or modifying victory conditions. For example, optimal strategies for variants of the dice game Pig have been analyzed using RL [47,48]. These derivatives matter for three reasons. First, they furnish tractable models of real-world scenarios where sequential, irreversible choices accumulate toward long-term rewards. Second, they reveal mathematical structures developed for generalization. Third, progress on complex variants often translates into advances in artificial-intelligence capabilities with practical impact far beyond games [49].

This paper introduces cumulative tic-tac-toe, a new variant inspired by practical application scenarios. We first formalize the game and discuss its potential relevance to practical decision problems. Next, we analyze the theoretical properties within the CGT framework and cast the game as a Markov decision process (MDP) to facilitate RL-based verification. We integrate a standard tabular RL algorithm with a domain-informed evaluation function that encodes expert priors to accelerate learning. Extensive experiments compare convergence with and without our evaluation function, demonstrating a notable improvement in training efficiency. We conclude by discussing implications for future research.

In summary, the main contributions of this paper are the following. We introduce cumulative tic-tac-toe, a scoring variant of

3 \times 3

tic-tac-toe in which play continues until the board is filled and payoffs are determined by the total number of completed three-in-a-row sequences. We analyze cumulative tic-tac-toe within a combinatorial game-theoretic framework and prove that, under optimal play, the game is a draw (Theorems 1 and 2). We then formulate cumulative tic-tac-toe as a finite MDP and develop a practical one-step TD learning procedure seeded with a novel CGT-motivated evaluation function, the triplet-coverage difference (TCD). While one-step TD is a standard RL technique, the novel element of this paper is the CGT-guided initialization: we seed the tabular value table using the TCD evaluation function, thereby leveraging combinatorial structure to accelerate TD learning in cumulative tic-tac-toe. Our experiments show that TCD initialization reduces episodes-to-convergence by approximately 23.1% compared with random initialization (one-sided Student’s t test, p-value = 0.0203), while preserving the draw outcome predicted by CGT. Finally, we discuss how this CGT–RL pipeline can inform the study of other cumulative-scoring decision problems and outline directions for scaling the approach to larger or more complex games.

2. Cumulative Tic-Tac-Toe

In this section, we first formalize the mechanics of cumulative tic-tac-toe by detailing its rules and scoring conventions. We then demonstrate its broader relevance by mapping the game’s cumulative-scoring framework onto six representative real-world decision problems drawn from engineering, economics, and the social sciences. Finally, we consider the variant’s extensibility to higher-dimensional boards and discuss how deviations from perfect play can lead to empirical outcomes that differ from the game’s theoretical determinacy.

2.1. Rules of the Game

Numerous extensions of classical tic-tac-toe have been proposed, including natural generalization to

n \times n = n^{2}

boards and more generally to

n \times \dots \times n = n^{d}

hypercubes [37,50,51,52]. Another notable variant is quantum tic-tac-toe, which is based on concepts from quantum physics [53].

Cumulative tic-tac-toe modifies the standard rules by allowing play to continue until all

3 \times 3 = 9

cells are filled rather than ending when the first three-in-a-row sequence, referred to as the first triplet, is formed. The outcome is determined by the total number of triplets secured by each player throughout the play. Thus, instead of a binary win/loss decision at the moment a triplet is formed, players compete to accumulate as many triplets as possible over the entire course of play. Each mark placed may contribute to multiple overlapping triplets, increasing the strategic depth of the game.

Importantly, cumulative tic-tac-toe preserves the spatial structure of the classical version, retaining the eight canonical triplets: three rows, three columns, one diagonal, and one antidiagonal. Table 1 provides a detailed comparison between the classic tic-tac-toe and cumulative tic-tac-toe. We will examine further theoretical insights and properties of this variant in the context of CGT in a later section.

Cumulative tic-tac-toe introduces more strategic depth than the standard

3 \times 3

game while remaining fully analyzable on the small board. The rules can be extended to larger boards, potentially allowing exploration of more complex sequential decision processes. Thus, cumulative tic-tac-toe may serve as a tractable model for studying strategic behavior with cumulative scoring.

2.2. Real-World Scenarios

Cumulative tic-tac-toe features sequential decisions with cumulative payoffs that can arise in various strategic settings. We outline six example scenarios (Figure 1) to illustrate potential analogies in engineering, economics, and social-science applications. These illustrative examples suggest how similar decision dynamics may appear in practice, though they are meant as analogies rather than direct implementations. Cumulative tic-tac-toe serves as a minimalist abstract model that isolates a specific structural property common to these complexities so that we can rigorously analyze the combinatorial nature of cumulative scoring in a controlled, fully observable environment.

Competitive resource allocation (e.g., dynamic spectrum access [54,55]) involves agents sequentially claiming resources that remain occupied for a duration. The cumulative throughput depends on these sequential allocations, analogous to accumulating triplets in the game. In such settings, strategies must balance immediate resource gains against long-term positioning, similar to the trade-offs in cumulative tic-tac-toe.

Iterative coverage tasks (e.g., unmanned aerial vehicle mapping [56]) involve agents sequentially securing grid cells to maximize total coverage. Each secured cell contributes to the overall payoff, much like each triplet contributes to a player’s score in cumulative tic-tac-toe. Effective coverage strategies must balance immediate area gains with preserving future coverage opportunities, reflecting trade-offs analogous to those in the game.

Sequential coalition-building (e.g., forming majority coalitions across successive votes [57]) involves accumulating influence over rounds to meet decision thresholds. Overlapping coalition possibilities can arise as actors join multiple coalitions, analogous to how central cells in the cumulative tic-tac-toe grid participate in multiple triplets. The process of sequentially assembling these coalitions resembles building multiple triplets in the game.

Adversarial spatial games like Go or Hex involve long-term strategic control on a board [58,59]. Cumulative tic-tac-toe is a simpler, fully observable analog that preserves key spatial–strategic dynamics. It provides a tractable testbed for developing and evaluating value-based RL methods or curricula before applying them to more complex games.

Iterative auctions (e.g., spectrum auctions [60]) involve bidders placing sequential bids to assemble complementary bundles. Each bid influences future opportunities, similar to how placing a marker in the game opens or closes potential triplets. The problem of ordering bids to assemble winning bundles is analogous to choosing moves that maximize triplet formation in cumulative tic-tac-toe.

Sequential market-entry models involve firms iteratively expanding into new markets or related sectors [61,62], creating overlapping strategic positions. Building these complementary activities is analogous to forming multiple triplets in cumulative tic-tac-toe, where each move contributes to several potential scoring lines. As in the game, each expansion shapes future opportunities, and success depends on the sequence of moves.

These six scenarios demonstrate how the cumulative tic-tac-toe paradigm captures long-term reward dynamics in strategic decision problems. This framework can be extended to higher-dimensional settings or adapted to reflect more complex real-world conditions. On the other hand, because these real-world scenarios are often too complex for an exact combinatorial solution, practitioners rely on heuristics. The use of RL mirrors how real-world agents learn to approximate optimal strategies in these cumulative-payoff environments when exact calculation is impossible.

We also note that classical game-theoretic analysis typically assumes perfect rationality. That is, every participant is fully informed, never makes errors, and, therefore, selects the strategy that maximizes their eventual payoff in every contingency [20,36]. Real decision-makers are fallible. Slips, misperceptions, and bounded cognitive resources frequently derail optimal play. For example, in classic tic-tac-toe, such departures open the door for the second player to capture a win, even though under flawless play, the best the second player can guarantee is a draw [37]. These empirical upsets do not contradict the formal theoretical results in CGT. They simply underline the behavioral gap between theory and practice.

3. Empirical Methodology

In game theory, an extensive-form representation is a tree-based structure that graphically models sequential decision-making processes [20,36]. The nodes of the tree represent decision points, such as a player’s move, and branches correspond to available actions. These diagrams conventionally depict nodes as solid circles and branches as connecting arrows. Given that cumulative tic-tac-toe is a sequential game, it is natural to illustrate its early moves via such a representation. A complete game tree captures all possible plays, with each branch encoding a unique sequence of decisions. A partial game tree captures only the possible plays at a certain decision point. Computer science has also explored the discrete structure and algorithms associated with game trees [63,64].

While traditional game theory suggests techniques such as backtracking or backward induction to analyze game trees, these approaches become infeasible for complex games due to exponential time requirements. Simulating random play is straightforward, but deriving optimal play is computationally intractable and typically lacks empirical grounding. RL offers a data-driven alternative for approximating or determining optimal strategies. Even when the outcome class (e.g., draw or first-player win) of a game is theoretically established, uncovering optimal strategies remains an open problem. For instance, although Hex is known to be a first-player win [65,66], a complete winning strategy is still unknown [37]. By allowing both players to engage in repeated play through an RL algorithm, we can empirically infer optimal decisions at each decision point. This exemplifies the synergy between CGT and RL in the study of novel game variants.

3.1. Combinatorial Game Theory

Building directly on the extensive-form tree, CGT provides the mathematical foundation for analyzing two-player zero-sum games of perfect information and no chance elements. The CGT methodology can be broadly divided into two areas: the theory of impartial games (e.g., Nim) and the theory of partizan games (e.g., tic-tac-toe) [2,37,67]. In impartial games, all players have the same available moves from any given position. These games are often decomposed into independent subgames, and their values are analyzed using the Sprague–Grundy theorem, which assigns nimbers, values on the basis of the single-heap version of Nim, to characterize outcomes and simplify the combinations of games [67,68,69].

Cumulative tic-tac-toe is inherently a partizan game, as the available moves and resulting outcomes depend on which player acts. Consequently, it cannot be analyzed via the Sprague–Grundy framework and instead falls within the scope of partizan or positional games, a subclass of combinatorial games.

In the CGT formulation, a positional game involves two players, Player 1 (first mover) and Player 2, who alternately claim positions on a finite board [52]. The game is governed by a defined set of rules, including an initial position and deterministic transitions based on each move. A move entails selecting an unclaimed position on the board, and a play is a complete sequence of alternating moves terminating in a terminal state. The game’s outcome is fully determined by this terminal state according to fixed rules. Importantly, these games exhibit perfect information, a core assumption in CGT.

Within this framework, the objective is to identify optimal strategies: strategies that allow a player to force a win or guarantee a draw in the absence of a winning path. However, calculating such strategies directly is typically infeasible because of the exponential size of the game tree. Fortunately, the RL approach presented after CGT offers a practical path to discovering these strategies through a data-driven approximation.

3.1.1. Formal Hypergraph Representation

We model cumulative tic-tac-toe as a finite hypergraph game

G = (V, F)

, where

V = {1, 2, \dots, 9}

labels the nine board cells, and

F = \{{1, 2, 3}, {4, 5, 6}, {7, 8, 9}, {1, 4, 7}, {2, 5, 8}, {3, 6, 9}, {1, 5, 9}, {3, 5, 7}\} \subset 2^{V}

is the collection of the eight triplets (rows, columns, and diagonals) relevant to determining final scores at terminal states.

A complete play of cumulative tic-tac-toe can be viewed as a sequence of moves

(x_{1}, y_{1}, x_{2}, y_{2}, x_{3}, y_{3}, x_{4}, y_{4}, x_{5}),

where

x_{i}

(for

i = 1, 2, 3, 4, 5

) is the ith move by Player 1 and

y_{j}

(for

j = 1, 2, 3, 4

) is the jth move by Player 2. Similarly, we index the decision times by

t = 0, 1, \dots, 8

, so that at time t, exactly t moves have already been made. Then, Player 1 moves when t is even (

t = 0, 2, 4, 6, 8

), choosing

x_{k + 1}

at

t = 2 k

, and Player 2 moves when t is odd (

t = 1, 3, 5, 7

), choosing

y_{ℓ}

at

t = 2 ℓ - 1

.

We denote by

H

the set of all finite sequences of distinct elements of V, i.e.,

H = \{(v_{1}, v_{2}, \dots, v_{t}) : t \geq 0, v_{i} \in V, v_{i} \neq v_{j}, i \neq j\},

and for each

t \in {0, 1, \dots, 8}

, let

h_{t} \in H

be the history (sequence) of length t just before the decision at time t. In particular,

h_{0} = (), h_{1} = (x_{1}), h_{2} = (x_{1}, y_{1}), h_{3} = (x_{1}, y_{1}, x_{2}), \dots, h_{8} = (x_{1}, y_{1}, x_{2}, y_{2}, x_{3}, y_{3}, x_{4}, y_{4}) .

Thus, Player 1 observes

h_{2 k}

when making their

(k + 1)

th move at time

2 k

(

k = 0, 1, 2, 3, 4

), and Player 2 observes

h_{2 ℓ - 1}

when making their ℓth move at time

2 ℓ - 1

(

ℓ = 1, 2, 3, 4

).

We define the special time

t = 9

to mark the conclusion of the game. Although the decision times run from

t = 0

to

t = 8

, the terminal time

t = 9

serves as a bookkeeping point at which the result of the full play is evaluated. At the terminal history

h_{9} = (x_{1}, y_{1}, \dots, x_{4}, y_{4}, x_{5})

, one defines

T_{1} = \{e \in F : e \subset {x_{1}, x_{2}, x_{3}, x_{4}, x_{5}}\}, T_{2} = \{e \in F : e \subset {y_{1}, y_{2}, y_{3}, y_{4}}\},

i.e.,

T_{1}

(

T_{2}

) is the collection of triplets fully occupied by Player 1 (Player 2). These sets determine the final play outcome under the cumulative scoring rule.

A pure strategy for a player is a function

Str : H_{turn} \to V,

where

H_{turn} \subset H

is the subset of histories at which it is that player’s turn to move. The requirement is that for each

h \in H_{turn}

,

Str (h) \in V ∖ {entries of h},

so that

Str (h)

is a legal move not already played.

A particularly useful type of pure strategy in CGT is a pairing strategy. A pairing is a bijection

ρ : M \to N

, where

M, N \subset V

,

| M | = | N | = ⌊ | V | / 2 ⌋ = 4

, and

M \cap N = \emptyset

. A pairing strategy

PStr (\cdot)

then replies to an opponent’s move by playing the paired cell, subject to legality. To formalize it, at time t, suppose that the opponent’s most recent move (in history

h_{t}

) is

v \in V

. If

v \in M

, reply with

ρ (v)

; if

v \in N

, reply with

ρ^{- 1} (v)

; if the pairing pair is already occupied or

t = 0

(first move), choose arbitrarily among the remaining legal moves. Mathematically, one may define

PStr (h_{t}) = \{\begin{matrix} ρ (v), & if the last move v in h_{t} lies in M and ρ (v) \notin h_{t}, \\ ρ^{- 1} (v), & if the last move v in h_{t} lies in N and ρ^{- 1} (v) \notin h_{t}, \\ any m \in V ∖ {entries of h}, & otherwise (e . g ., if the paired cell is occupied or at t = 0) . \end{matrix}

This ensures that

PStr (h_{t})

is always a legal move. Specific versions of pairing strategies fix how the arbitrary choices are made (e.g., random or following a fixed priority).

3.1.2. Play Outcomes

A play of cumulative tic-tac-toe is a full sequence of nine distinct moves, alternating between Player 1 and Player 2, that fills the board. At terminal history

h_{9}

, the outcome of a play is determined by comparing the numbers of fully occupied hyperedges for each player:

Player 1 wins if $| T_{1} | > | T_{2} |$ .
Player 2 wins if $| T_{1} | < | T_{2} |$ .
Otherwise, $| T_{1} | = | T_{2} |$ , the game is a draw.

Because Player 1 occupies exactly five cells and Player 2 occupies exactly four cells in a complete play,

0 \leq | T_{1} | \leq 2, 0 \leq | T_{2} | \leq 1, 0 \leq | T_{1} | + | T_{2} | \leq 2 .

Consequently, the pair

(| T_{1} |, | T_{2} |)

can take exactly five possible values over all plays (namely,

(0, 0)

,

(1, 0)

,

(2, 0)

,

(0, 1)

,

(1, 1)

), and the above rules partition these values into play outcomes.

3.1.3. Game Determinacy

Before analyzing specific strategies, we first establish that cumulative tic-tac-toe is a determined game: under optimal play, exactly one of the three outcomes (Player 1 wins, Player 2 wins, or a draw) must occur [70]. In particular, we use a classical strategy-stealing argument [71,72,73] to eliminate the possibility of a forced win for Player 2. It follows that Player 1 can guarantee at least a draw. Theorem 1 formalizes this determinacy.

Theorem 1.

Consider cumulative tic-tac-toe on the finite hypergraph

G = (V, F)

. Then, the game is determined in the sense that either Player 1 has a winning strategy or both players have drawing strategies.

Proof.

The proof appears in Appendix A.1. □

3.1.4. Game Outcomes

We refine our understanding of determinacy by classifying the game according to stronger notions of forced outcomes and the existence or nonexistence of a single universal pairing strategy that guarantees a draw. Informally, even if a game is a draw under optimal play, one may ask whether there is a single pairing function

ρ

such that the pairing strategy ensures a draw for whichever player employs it, regardless of how the opponent plays (subject to legality). We introduce the following outcome classes for cumulative tic-tac-toe:

Player 1 trivial win: Regardless of how both players behave, every play ends with $| T_{1} | > | T_{2} |$ . Equivalently, every play is a Player 1 win.
Player 1 forced win without draw possibility: Under optimal play by both, every play ends with $| T_{1} | > | T_{2} |$ . However, there exist suboptimal plays where $| T_{1} | < | T_{2} |$ but never $| T_{1} | = | T_{2} |$ . Equivalently, Player 1 has a forced win, and no drawing play can occur under every play.
Player 1 forced win but draws possible if suboptimal: Player 1 has a winning strategy, so under optimal play, we always have $| T_{1} | > | T_{2} |$ , yet there exist suboptimal plays where $| T_{1} | \leq | T_{2} |$ .
Strong draw (no single universal pairing strategy draw): Under optimal play by both, the result is always a draw $| T_{1} | = | T_{2} |$ , but there is no single pairing function $ρ$ such that whichever player uses the corresponding pairing-based reply strategy can force a draw against every possible opponent reply.
Pairing strategy draw: Under optimal play by both parties, the game is always a draw, and there exists a single pairing function $ρ$ such that if Player 1 (Player 2) employs the pairing reply strategy based on $ρ$ , then irrespective of how the opponent plays legally, the result is a draw. In other words, the same $ρ$ yields a draw-forcing strategy for either role.

According to earlier analyses of play outcomes, cumulative tic-tac-toe is not in the first two classes. We now locate it in one of the latter three classes and show the following:

Theorem 2.

Cumulative tic-tac-toe is a strong draw.

Proof.

The proof appears in Appendix A.2. □

From the CGT perspective, the fact that no single pairing strategy suffices shows that deriving an explicit optimal policy by pure combinatorial arguments is highly nontrivial. These combinatorial observations identify specific structural features that correlate strongly with outcome: in particular, the number and distribution of open triplets determine the player’s options for future triplet completion. We transfer this CGT insight into the RL pipeline by designing the TCD evaluation function. Concretely, TCD quantifies the difference in open triplet counts favorable to the current player and is used to initialize the tabular value function. This CGT–RL mapping preserves the theoretical understanding of which positions are strong while enabling data-driven policy discovery in spaces where an exhaustive combinatorial solution is onerous.

3.2. Reinforcement Learning

To analyze cumulative tic-tac-toe through both CGT and RL, we must express the game in representations familiar to each framework. In the RL setting, we represent the game as a finite MDP, allowing CGT concepts to map cleanly onto RL notation and enabling data-driven discovery of optimal play. With the game cast as an MDP, we introduce one-step TD learning as our core value estimation method. To accelerate convergence, we present domain-informed evaluation functions, particularly the TCD, and detail our TD algorithm, which seeds its value table with the heuristic. While one-step TD itself is a standard RL technique, our contribution lies in a CGT-guided initialization: we propose and evaluate the TCD evaluation function to seed the tabular value function, thereby leveraging structural information from CGT to accelerate TD learning in cumulative tic-tac-toe. Finally, we discuss several alternative evaluation functions and explain how they can further improve learning efficiency.

3.2.1. Markov Decision Process

An MDP formalizes sequential decision-making under uncertainty. An agent interacts with an environment over discrete time steps. At each step, the agent observes the current state, selects an action, receives a numerical reward (possibly contingent on the chosen action), and transitions to a new state.

These interactions generate a sequence, or trajectory, of states, actions, and rewards. If a trajectory ends in a designated terminal state, the sequence resets, and a new trajectory begins from the initial state. A complete sequence from the start state to the terminal state constitutes an episode.

Crucially, each action influences not only the immediate reward but also future states and, hence, future rewards. The agent, therefore, seeks to maximize the cumulative reward over the entire episode rather than the reward at the next step alone. To do so, it follows a policy, a mapping from states to probability distributions over available actions. Given a policy, the state-value function assigns to every state the expected return (cumulative reward) obtained by following that policy until termination.

Different policies induce different state-value functions. The objective is to approximate the optimal state-value function, which is defined by acting optimally at every decision point. Estimating these values allows the agent to attribute long-term consequences to individual decisions and converge on strategies that maximize the expected return.

3.2.2. Formal MDP Representation

We now cast cumulative tic-tac-toe as a two-agent zero-sum MDP in which the control alternates between Player 1 and Player 2. In practice, one may train a single self-play algorithm, but it is equivalent to viewing one player as the agent and the other as part of the environment.

States and Actions

Let the discrete decision times be

t = 0, 1, \dots, 8

, where, at time t, exactly t cells are occupied. We similarly define the special time

t = 9

as the terminal time. Denote by

S_{t} = (s_{t}^{1}, s_{t}^{2}, \dots, s_{t}^{9}) \in S = {- 1, 0, 1}^{9}

the board configuration at time t, using the encoding

s_{t}^{i} = \{\begin{matrix} 1, & cell i occupied by Player 1, \\ - 1, & cell i occupied by Player 2, \\ 0, & cell i unoccupied . \end{matrix}

The turn to move is implicit: if

\sum_{i = 1}^{9} s_{t}^{i} = 0

, it is Player 1’s turn; if

\sum_{i = 1}^{9} s_{t}^{i} = 1

, it is Player 2’s turn; terminal states satisfy

\sum_{i = 1}^{9} |s_{9}^{i}| = 9

.

At each nonterminal

S_{t}

, the legal action set is

A (S_{t}) = {i : s_{t}^{i} = 0} \subseteq {1, 2, \dots, 9},

and choosing

A_{t} = i \in A (S_{t})

yields

S_{t + 1} = Update (S_{t}, i),

where

Update (S_{t}, i)

sets

s_{t + 1}^{i} = 1

if Player 1 moves or

- 1

if Player 2 moves, and leaves other cells unchanged. Since there is no chance, each transition is deterministic:

P (S_{t + 1} | S_{t}, A_{t}) \in {0, 1},

where

P (\cdot | S_{t}, A_{t})

denotes the conditional probability measure over all possible successor states given the current state and action.

Rewards

All intermediate rewards

R_{t} = 0

for

t < 9

. At the terminal step

t = 9

, we assign the zero-sum payoffs for Player 1 and Player 2

R_{9}^{1} = \{\begin{matrix} + 1, & | T_{1} | > | T_{2} |, \\ 0, & | T_{1} | = | T_{2} |, \\ - 1, & | T_{1} | < | T_{2} |, \end{matrix} and R_{9}^{2} = - R_{9}^{1} .

Thus, a complete episode yields

S_{0}, A_{0}, S_{1}, A_{1}, \dots, S_{8}, A_{8}, (R_{9}^{1}, R_{9}^{2}), S_{9} .

Policies and Value Functions

A policy (possibly stochastic) for Player 1 is

π^{1} : S \to Δ (A)

, where

Δ (A)

denotes the set of all probability distributions over the legal action set

A

, and similarly Player 2 has

π^{2}

. We combine them into a joint policy

π (A_{t} | S_{t}) = \{\begin{matrix} π^{1} (A_{t} | S_{t}), & if \sum_{i = 1}^{9} s_{t}^{i} = 0, \\ π^{2} (A_{t} | S_{t}), & if \sum_{i = 1}^{9} s_{t}^{i} = 1 . \end{matrix}

Under

π

, the state-value for Player 1 is

v_{π}^{1} (S_{t}) = E_{π} (R_{9}^{1} | S_{t}),

where the expectation

E_{π} (\cdot | S_{t})

is taken over all trajectories generated by following

π

from state

S_{t}

until terminal time

t = 9

. For Player 2,

v_{π}^{2} (S_{t}) = - v_{π}^{1} (S_{t})

because cumulative tic-tac-toe is a zero-sum game. Since all the rewards occur at

t = 9

, these satisfy the deterministic Bellman recursion

v_{π}^{1} (S_{t}) = \{\begin{matrix} \sum_{a \in A (S_{t})} π^{1} (a | S_{t}) v_{π}^{1} (Update (S_{t}, a)), & if \sum_{i = 1}^{9} s_{t}^{i} = 0, \\ \sum_{b \in A (S_{t})} π^{2} (b | S_{t}) v_{π}^{1} (Update (S_{t}, b)), & if \sum_{i = 1}^{9} s_{t}^{i} = 1 . \end{matrix}

Optimality

Because the game is zero-sum, there exists a saddle-point minimax solution, which is based on CGT. Denote the optimal value for each player by

v_{*}^{1} (S_{t}) = max_{π^{1}} min_{π^{2}} v_{π}^{1} (S_{t}) and v_{*}^{2} (S_{t}) = - v_{*}^{1} (S_{t}) .

The Bellman optimality equation for Player 1 is

v_{*}^{1} (S_{t}) = max_{a \in A (S_{t})} min_{b \in A (Update (S_{t}, a))} v_{*}^{1} (Update (Update (S_{t}, a), b)) .

Solving this MDP by RL yields policies

π_{*}^{1}

and

π_{*}^{2}

that achieve the minimax value. This computational approach complements the combinatorial game-theoretic guarantee that the game is a draw under optimal play, while providing an explicit procedure to approximate those optimal policies.

3.2.3. Temporal-Difference Learning

Classical solutions to RL problems, most notably DP, can, in principle, compute exact optimal values by solving the Bellman equations for

v_{π}

. In practice, this demands a complete model of the environment, including the opponent’s transition probabilities at every decision node, and incurs an exponential search cost that quickly becomes prohibitive. An alternative is to learn from experience. In MC methods, we estimate

v_{π} (s)

by averaging the returns observed on every first visit to state s while following policy

π

and denote this empirical estimate by

V (s)

. Owing to the strong law of large numbers,

V (s)

converges almost surely to

v_{π} (s)

as the number of first visits tends to infinity. However, MC updates occur only when each episode terminates.

TD learning combines the online updates of MC with the bootstrapping of DP. After every step, TD methods adjust the current state’s value toward the estimated value of its successor, thus learning directly from interaction without a full model. Because of its balance of efficiency and accuracy, we adopt a one-step TD method for cumulative tic-tac-toe. Function approximation methods in RL can also be readily applied when the state or action space is too large or continuous in complex, real-world scenarios.

Initialization

With a tabular representation, we store two separate estimated value functions, one per player:

V^{1}, V^{2} : S \to [- 1, 1],

where

V^{1} (s)

denotes the estimated terminal payoff for Player 1 starting from state s, and

V^{2} (s)

denotes the estimated terminal payoff for Player 2. In the RL literature, one initializes every nonterminal state to an arbitrary value, but we introduce a more informative heuristic that accelerates learning.

Action Selection

During training, both players repeatedly simulate play. At each move, they choose among the legal actions:

Exploitation: select a greedy action whose successor state currently has the highest estimated value, with ties broken arbitrarily.
Exploration: with probability $ϵ$ , select a random action, thereby refining estimates for suboptimal moves that may conceal tactical traps.

Let

| A (s) |

denote the number of legal moves in state s. Then, under an

ϵ

-greedy policy, the greedy move is chosen with probability

1 - ϵ + \frac{ϵ}{| A (s) |},

and each non-greedy move with probability

ϵ / | A (s) |

.

Value Update

Because all intermediate rewards vanish, one-step TD updates simplify. When Player 1 acts in

S_{t}

and yields successor

S_{t + 1}

, we update Player 1’s value estimate of

S_{t}

by

V^{1} (S_{t}) \leftarrow V^{1} (S_{t}) + α [V^{1} (S_{t + 1}) - V^{1} (S_{t})],

and symmetrically for Player 2,

V^{2} (S_{t}) \leftarrow V^{2} (S_{t}) + α [V^{2} (S_{t + 1}) - V^{2} (S_{t})],

where

α \in (0, 1]

is a step-size parameter and at terminal states, we set

V^{1} (s_{terminal}) = R_{9}^{1} \in {+ 1, 0, - 1} and V^{2} (s_{terminal}) = - V^{1} (s_{terminal})

so that terminal estimates equal the true terminal returns.

3.2.4. Evaluation Functions

The game tree of cumulative tic-tac-toe is already very large, not to mention the future variants of such a game and their potential applications in practice. Various compression techniques exist, most notably merging board configurations that are equivalent under symmetry, but different move orders can reach the same configuration, so perfect deduplication requires nontrivial bookkeeping, and the resulting trick is seldom portable to other games.

A more general remedy is to seed learning with an evaluation function. Heuristics using evaluation functions are common in chess engines and other supervised settings [74,75,76,77], and their systematic use in RL remains comparatively underexplored [49]. An evaluation function assigns an informative initial value, instead of an arbitrary value, to every nonterminal state, guiding exploration and accelerating convergence.

Because unsolved games lack closed-form value formulas, the shape of an evaluation function is chosen empirically. One inserts a candidate score, probabilistic or ordinal, into the RL loop and measures performance. For instance, a state might be initialized with the estimated win probability under random play or with a signed advantage score where positive values favor the current player and negative values favor the opponent. Note that the design of evaluation functions depends heavily on various factors such as the game-theoretical properties, MDP formulation, and appropriate RL solution methods.

Triplet-Coverage Difference

CGT analysis shows that configurations with many open triplets, from either player’s perspective, are intrinsically strong. Assume we are at a particular board state s. Let X (Y) denote the set of cells occupied by Player 1 (Player 2). If it is Player 1’s turn, define

d_{TCD} (s) = (triplets containing no cell in Y) - (triplets containing no cell in X) .

We call

d_{TCD} (s)

the triplet-coverage difference (TCD). A positive

d_{TCD} (s)

means that Player 1 has more ways to complete a triplet in the given board configuration; a negative

d_{TCD} (s)

favors Player 2. From Player 2’s perspective, the evaluation is simply

- d_{TCD} (s)

.

For consistency with the MDP we constructed, whose value range is

[- 1, 1]

, we normalize

d_{TCD} (s)

via its theoretical bounds 4 (best) and

- 2

(worst):

{\tilde{d}}_{TCD} (s) = \frac{d_{TCD} (s) - 1}{3} \in [- 1, 1],

so that it can be applied more effectively and efficiently. Therefore, we initialize the two value tables for each nonterminal state s by setting

V^{1} (s) \leftarrow \{\begin{matrix} {\tilde{d}}_{TCD} (s), & if Player 1 is to move in s, \\ - {\tilde{d}}_{TCD} (s), & if Player 2 is to move in s, \end{matrix} V^{2} (s) \leftarrow - V^{1} (s) .

3.2.5. Algorithm

We now describe a one-step TD algorithm that uses TCD as an informative prior for all nonterminal states while assigning each terminal state its exact payoff (

\pm 1

or 0 depending on

| T_{1} |

and

| T_{2} |

). At each time step, the current player i selects moves according to an

ϵ

-greedy policy derived from the estimated state-value function

V^{i}

, exploiting the highest-valued successor most of the time but occasionally exploring random actions, and then immediately bootstraps

V^{i} (S)

toward the current estimate

V^{i} (S^{'})

of the successor state. The complete algorithm pseudocode is provided as Algorithm 1, and the corresponding flowchart is shown in Figure A1.

Algorithm 1: One-Step TD with TCD Initialization

3.2.6. Alternative Evaluation Functions

Beyond the basic TCD evaluation function, several refinements can, in principle, further accelerate convergence.

Weighted Triplet-Coverage Difference

The weighted triplet-coverage difference (WTCD) evaluation function is a modified version of TCD. Because the center cell participates in four triplets, it is more valuable than a corner (three triplets) or an edge (two triplets). Let each triplet be weighted by the sum of the weights of its constituent cells (e.g., center

= 2

, edges

= 1.5

, corners

= 1

). Define X and Y similarly as in TCD. The WTCD, if it is Player 1’s turn at state s,

d_{WTCD} (s) = (weighted triplets containing no cell in Y) - (weighted triplets containing no cell in X)

is then normalized to

{\tilde{d}}_{WTCD} (s) \in [- 1, 1]

before it is written to the value table, similar to the procedures in TCD. For Player 2, the weighted evaluation is again

- d_{WTCD} (s)

. This bias toward central control guides learning more quickly toward strong play.

Two-Step Look-Ahead

This heuristic evaluates both the current board and the tactical impact of each player’s next move. For every empty cell, we hypothetically place a marker for the acting player and count the resulting triplets. Let

d_{current} (s)

be the present TCD, or WTCD, at state s, and let

d_{next} (s)

be the maximal TCD, or WTCD, achievable on the next state of state s by either side. We define

d_{2 SLA} (s) = d_{current} (s) + d_{next} (s)

to be the two-step look-ahead (2SLA) heuristic and again rescale to

[- 1, 1]

for our purpose. By incorporating short-term foresight, the agent prioritizes states that are most likely to yield additional triplets. It is straightforward to generalize this method to a k-step look-ahead (KSLA) scenario where dimensions are higher, but for this problem, it is not necessary.

These heuristics are illustrative rather than exhaustive. Any cheaply computed signal that correlates with long-run returns can serve as an evaluation function, provided that the extra overhead does not outweigh the savings in training time. Thus, the use of evaluation functions in RL can be intrinsically powerful depending on the application.

4. Results

We now turn from theoretical analysis to an empirical evaluation of our framework. Our experiments are designed to answer three key questions:

1.: Effect of evaluation functions. How much does seeding RL with domain-informed heuristics (in particular, TCD) accelerate convergence compared with a naive random initialization in the RL literature?
2.: Consistency with CGT. Once converged, do two frozen policies produce draws when played head-to-head, matching the CGT determinacy result?
3.: Evaluation against human opponents. How does the converged policy fare against human opponents, and does it confirm the theoretical first-player/second-player draw outcome in practice?

Table 2 summarizes the results of all the experiments.

4.1. Execution Environment and Reproducibility

All the experiments were run on Microsoft Windows 11 Home (10.0.26100 Build 26100). The host CPU was an Intel Core Ultra 7 155H @ 1.40 GHz (16 cores, 22 threads) with 32 GB of RAM. The graphics acceleration (when used) came from an Intel Arc integrated GPU with approximately 2 GB of dedicated memory. The code was developed in Python 3.13.1, and the key libraries NumPy 2.2.1 and SciPy 1.16.1. Random seeds (121–130) were fixed to ensure reproducibility. Reproduction instructions and the exact code/data are provided in the repository cited in the Data Availability Statement.

4.2. Experiment 1: Effect of the Evaluation Functions

In Experiment 1, we measured how quickly TD converges when seeded with arbitrary initialization and TCD. To avoid bias, we first tuned the core RL hyperparameters under a zero-initialized baseline (

V^{1} (s) = V^{2} (s) = 0

for all nonterminal

s \in S

) in Stage 1. These fixed hyperparameters were then used to compare random initialization against the TCD heuristic in Stage 2.

Stage 1: hyperparameter tuning under zero initialization

All training runs were executed as independent episodes of self-play. Each seed was run until the convergence criterion was met or until a hard upper bound of 1,000,000 episodes was reached. Convergence was declared when the empirical sample variances of the cumulative win and draw rates over a sliding window of length

W = 20, 000

episodes fell below

δ = 10^{- 8}

. Mathematically speaking, define the estimated cumulative Player 1 win rate after n episodes by

{\hat{p}}_{1} (n) = \frac{# Player 1 wins up to episode n}{n},

and similarly, the estimated cumulative draw rate by

{\hat{p}}_{d} (n) = \frac{# draws up to episode n}{n} .

After every episode, we appended the tuple

({\hat{p}}_{1} (n), {\hat{p}}_{d} (n))

to a history buffer. When

n \geq W

episodes had elapsed, we extracted the last W entries and formed the two sequences of length W corresponding to

{\hat{p}}_{1}

and

{\hat{p}}_{d}

. We then computed the empirical sample variances of these two sequences, denoted

s_{1, W}^{2}

and

s_{d, W}^{2}

. Convergence was declared at the first episode for which both sample variances were below the threshold

δ

(i.e.,

s_{1, W}^{2} < δ

and

s_{d, W}^{2} < δ

). Note that Player 2’s cumulative win rate is implicit since

{\hat{p}}_{2} (n) = 1 - {\hat{p}}_{1} (n) - {\hat{p}}_{d} (n)

, and therefore monitoring

{\hat{p}}_{1}

and

{\hat{p}}_{d}

suffices to detect stabilization of all three outcome proportions.

This procedure ensures that stabilization is required for both players before termination. Such a tuning setup ensures that any observed speed-ups in Stage 2 stem from initialization rather than hyperparameter differences. We performed a grid search over

ϵ \in {0.05, 0.10, 0.20, 0.30}, α \in {0.05, 0.10, 0.20, 0.50},

running five independent seeds per pair (seeds 121–125). For each pair, we recorded the per-seed episode count at which the convergence criterion was first met, then computed the sample mean and sample standard deviation across the five seeds. We selected

(ϵ^{*}, α^{*}) = (0.05, 0.50)

because it attained one of the smallest mean convergence times while maintaining acceptably low variability. Table 3 provides these results, and we present the raw seed-by-seed

(ϵ, α)

tuple data in Table A1. All subsequent comparisons use

(ϵ^{*}, α^{*})

.

Stage 2: initialization comparison (random vs. TCD)

With

ϵ^{*} = 0.05

and

α^{*} = 0.50

fixed, we compared two initialization schemes:

Random initialization (baseline): For all nonterminal $s \in S$ ,

$V^{1} (s), V^{2} (s) \overset{i . i . d .}{\sim} Uniform (- 1, 1) .$
TCD heuristic: For each nonterminal state s, we computed the normalized TCD ${\tilde{d}}_{TCD} (s) \in [- 1, 1]$ for the player-to-move. We then initialized the tables by setting

$V^{1} (s) \leftarrow \{\begin{matrix} {\tilde{d}}_{TCD} (s), & if Player 1 is to move in s, \\ - {\tilde{d}}_{TCD} (s), & if Player 2 is to move in s, \end{matrix} V^{2} (s) \leftarrow - V^{1} (s) .$

For each initialization scheme (random vs. TCD), we ran five new independent seed runs (seeds 126–130) and recorded the episodes required to reach the convergence criterion (similarly, sample variances less than

δ = 10^{- 8}

for both the cumulative win and draw rates over a sliding window

W = 20, 000

). The random initialization group produced sample mean episodes to convergence 190,087 with a sample standard deviation 24,216, while the TCD group produced a sample mean 146,262 with a sample standard deviation 32,020. The results are recorded in Table 4, and raw seed-by-seed data are presented in Table A2.

Next, we verified key assumptions required and conducted a standard two-sample (Student’s) t test on the per-seed convergence times to establish statistical significance:

Independence check: By design, each seed run is an independent experiment, so the two groups (random vs. TCD) are independent.
Normality checks (Shapiro–Wilk test): The Shapiro–Wilk test [78] tests the null hypothesis that the data are from a normal distribution. We have p-value = 0.9382 for the random initialization sample and p-value = 0.6096 for the TCD sample. We fail to reject the normality assumption for both samples at the 5% level of significance.
Equal-variance check (Levene’s test): Levene’s test [79] tests the null hypothesis that all input samples are from populations with equal variances. We have p-value = 0.7088. We fail to reject the homoscedasticity assumption at the 5% level.
Two-sample t test: After confirming the above assumptions, we can proceed with a standard t test. A two-sample t test [80] is a test for the null hypothesis that two independent samples have identical means, assuming that the populations have identical variances. We have an upper one-sided p-value = 0.0203. We reject the null that the two population means are the same, in favor of a significant reduction at the level of 5%.

Practically, this corresponds to an average reduction of 23.1% in episodes to convergence, which translates to substantial computational savings in repeated training scenarios. We expect WTCD and 2SLA heuristics to improve efficiency further, although their computational cost must be weighed against this benefit.

To obtain explicit policies from the converged estimates, we froze the learned value tables

V^{1}

and

V^{2}

and derived deterministic greedy policies

π_{*}^{1}

and

π_{*}^{2}

. Concretely, for each reachable nonterminal state s, we first determined the player to move and then selected the greedy action according to that player’s table:

π_{*}^{1} (s) = \underset{a \in A (s)}{arg max} V^{1} (Update (s, a)) and π_{*}^{2} (s) = \underset{a \in A (s)}{arg max} V^{2} (Update (s, a)),

where only the relevant policy

π_{*}^{i}

is consulted when player i is to move in s. Ties among maximizers were resolved randomly by selecting any action leading to the highest value.

4.3. Experiment 2: Consistency with Combinatorial Game Theory

The second experiment tests whether the frozen deterministic greedy policies yield 100% draws in head-to-head play. After convergence for each random seed between 126 and 130, for each heuristic, we froze the learned policies with no exploration (

ϵ = 0

) and played 100 head-to-head matches between two frozen converged policies

π_{*}^{1}

and

π_{*}^{2}

. All 100 plays ended in draws, in exact agreement with the CGT determinacy result.

4.4. Experiment 3: Evaluation Against Human Opponents

The third experiment evaluates the learned policy against human participants in both roles. We invited volunteers to play as Player 1 against the converged TCD policy. In every trial, the human participants failed to win. Repeating the experiment with participants assigned to Player 2 yielded the same outcome (0% win rate), further corroborating that the learned policy attains the game-theoretic draw.

5. Discussion

Building on the canonical

3 \times 3

tic-tac-toe, we develop a novel cumulative tic-tac-toe framework and analyze it through the complementary lenses of CGT and RL. We first formalize the variant and outline six real-world scenarios, and then establish its key theoretical properties within the CGT framework. Next, we cast the game as an MDP and validate optimal play via one-step TD learning. To accelerate learning, we seeded the estimated state-value function with the domain-informed TCD heuristic, yielding a statistically significant 23.1% reduction in convergence episodes relative to a random initialization while preserving the draw outcome predicted by CGT in our head-to-head and human tests. Convergence was declared when the sample variances of cumulative win and draw rates fell below

δ = 10^{- 8}

over a

W = 20, 000

-episode sliding window, ensuring genuine stability rather than transient fluctuations.

We emphasize that our operational convergence criterion signals empirical stabilization of the training process across episodes and seeds. It does not constitute a formal mathematical proof of global optimality. In other words, the criterion indicates that the policy is stable under the chosen training regimen, and, in our tests, the stabilized policies produced the theorized draw outcome, but a mathematical guarantee of optimality would require additional analysis (e.g., proving monotone improvement and eventual convergence to the minimax value under the given learning rates and exploration schedule).

The present study employs a full tabular representation of the state-value function. Scaling to larger boards will require function-approximation techniques (e.g., linear architectures or deep neural networks) and more systematic hyperparameter optimization. All hyperparameter values in our prototype were chosen empirically since no formal guidelines exist at this stage. Our limited sweep over 16 pilot combinations (each averaged over five random seeds) yielded one representative setting

(ϵ^{*}, α^{*})

. A more comprehensive grid search or Bayesian optimization on all hyperparameters, evaluated over five to ten seeds per setting, is planned to substantiate and stabilize these choices. In future work, we will investigate WTCD and 2SLA to determine if their computational overhead is outweighed by further speed-ups. We believe that incorporating domain-informed evaluation functions is a promising direction in RL, with potential for substantial improvements in training efficiency and stability.

Our combined CGT–RL approach could also be extended to multiplayer, stochastic, or imperfect-information games. These generalizations introduce new challenges (e.g., uncertainty and hidden information), but provide richer, more realistic testbeds for future study. Future work will also tailor the cumulative tic-tac-toe framework to specific application domains by adapting the board, action set, and scoring rules. For example, we plan to apply the model to strategic resource allocation and multi-agent planning to assess how well cumulative tic-tac-toe captures domain-specific features and to establish benchmark problems for CGT and RL research.

Last but not least, we used tabular one-step TD as a proof-of-concept in this study. Alternative RL approaches, such as Q-learning, MC control, actor–critic methods, and deep neural-network-based RL, have different empirical trade-offs in convergence speed, sample efficiency, and generalization. A systematic benchmark should compare these algorithms on cumulative tic-tac-toe, with and without CGT-informed evaluation initializations, using metrics such as episodes-to-convergence, stability across seeds, and computational cost (wall-clock time and per-episode complexity). Related evaluations in other decision problems provide guidance and suggest that algorithm choice interacts nontrivially with problem structure and initialization.

Author Contributions

Conceptualization, K.L. and W.Z.; methodology, K.L.; software, K.L.; validation, K.L.; formal analysis, K.L.; investigation, K.L. and W.Z.; resources, K.L.; data curation, K.L.; writing—original draft preparation, K.L.; writing—review and editing, K.L. and W.Z.; visualization, K.L.; supervision, K.L.; project administration, K.L.; funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

W.Z.’s work is partly supported by the National Science Foundation under Grant NRT-HDR 2125295. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Data Availability Statement

All code, generated data, and supplementary outputs required to reproduce the experiments and results in this paper are openly available at the project repository: https://github.com/Garylikai/cumulative-tictactoe (commit d17c7ee; accessed on 9 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

2SLA	Two-step look-ahead
CGT	Combinatorial game theory
DP	Dynamic programming
i.i.d.	Independent and identically distributed
KSLA	k-step look-ahead
MC	Monte Carlo
MDP	Markov decision process
RL	Reinforcement learning
TCD	Triplet-coverage difference
TD	Temporal-difference
WTCD	Weighted triplet-coverage difference

Appendix A. Proofs of Theorems

Appendix A.1. Proof of Theorem 1

Proof.

We first argue by standard alternation of quantifiers that exactly one of three cases holds: Player 1 wins, Player 2 wins, or both draw. We then eliminate the second case via a strategy-stealing argument.

General determinacy.

Suppose that Player 1 has a winning strategy. Mathematically, in cumulative tic-tac-toe, this means that

\exists x_{1} \forall y_{1} \exists x_{2} \forall y_{2} \exists x_{3} \forall y_{3} \exists x_{4} \forall y_{4} \exists x_{5} : | T_{1} | > | T_{2} | .

Since Player 1’s winning strategy exists, it can be characterized generally by

Str (h_{2 k}) = x_{k + 1} \in V ∖ {entries of h_{2 k}}

for all

k = 0, 1, 2, 3, 4

. By similar arguments, Player 2 has a winning strategy, which means that

\forall x_{1} \exists y_{1} \forall x_{2} \exists y_{2} \forall x_{3} \exists y_{3} \forall x_{4} \exists y_{4} \forall x_{5} : | T_{1} | < | T_{2} | .

and Player 2’s winning strategy is characterized generally by

Str (h_{2 ℓ - 1}) = y_{ℓ} \in V ∖ {entries of h_{2 ℓ - 1}}

for all

ℓ = 1, 2, 3, 4

. By De Morgan’s law, if neither Player 1 nor Player 2 has a winning strategy, then it has to be the case

\neg [(\exists x_{1} \forall y_{1} \dots \exists x_{5} : | T_{1} | > | T_{2} |) \lor (\forall x_{1} \exists y_{1} \dots \forall x_{5} : | T_{1} | < | T_{2} |)]

\Rightarrow (\forall x_{1} \exists y_{1} \dots \forall x_{5} : | T_{1} | \leq | T_{2} |) \land (\exists x_{1} \forall y_{1} \dots \exists x_{5} : | T_{1} | \geq | T_{2} |)

such that (i) Player 1 loses or it is a draw, and (ii) Player 2 loses or it is a draw. This implies that both players have a drawing strategy that forces a draw in cumulative tic-tac-toe, since we assume that neither player prefers losing. Such drawing strategies for both players can be guaranteed in a general sense by

x_{1} = Str (h_{0}), \forall y_{1}, x_{2} = Str (h_{2}), \forall y_{2}, \dots, \forall y_{4}, x_{5} = Str (h_{8}),

and

\forall x_{1}, y_{1} = Str (h_{1}), \forall x_{2}, y_{2} = Str (h_{3}), \dots, y_{4} = Str (h_{7}), \forall x_{5} .

2.: Eliminate Player 2 wins.

We now show that Player 2 cannot have a winning strategy in cumulative tic-tac-toe. Suppose, for contradiction, that Player 2 has a strategy

{Str}_{2}

guaranteeing

| T_{1} | < | T_{2} |

no matter how Player 1 plays. We construct from

{Str}_{2}

a strategy

{Str}_{1}

for Player 1 that guarantees

| T_{1} | > | T_{2} |

, contradicting the assumption that Player 2’s strategy always wins.

On Player 1’s first move,

{Str}_{1}

plays an arbitrary placeholder cell

m_{1}

. We remark that the extra mark

m_{1}

cannot worsen Player 1’s final triplet comparison: adding cells to Player 1’s occupied set cannot decrease the number of triplets completed by Player 1 and cannot create new triplets for Player 2 (Player 2’s occupied set is unchanged). Hence, any placeholder marks can only increase Player 1’s possibilities in forming new triplets or reduce Player 2’s, never the reverse.

Concretely, after the initial move

x_{1} = m_{1}

, let Player 2 reply with

y_{1}

according to

{Str}_{2} (m_{1})

. Now Player 1 ignores

m_{1}

when consulting

{Str}_{2}

and regards

y_{1}

as if it were Player 2’s first move in a reduced play where Player 1 moves next. Thus, Player 1 chooses

{Str}_{2} (y_{1})

. Thereafter, whenever it is Player 1’s turn at time

2 k

, they consult

{Str}_{2}

as if occupying Player 2’s role in the original strategy: given the history (excluding the ignored

m_{1}

if still needed), they pick

{Str}_{2} (h_{2 k} ∖ m_{1})

. If

{Str}_{2} (h_{2 k} ∖ m_{1})

coincides with an already-occupied cell, Player 1 plays an arbitrary unused placeholder cell

m_{2}

and continues the simulation. By the monotonicity remark above, such placeholders cannot harm Player 1’s final comparison. They act only as non-harmful markers that preserve the intended simulation of

{Str}_{2}

. Thus, the invariant holds: Player 1 effectively follows

{Str}_{2}

with roles reversed.

Because

{Str}_{2}

was assumed to guarantee

| T_{1} | < | T_{2} |

when used by the second player in the reduced play, the player simulating

{Str}_{2}

in that reduced play would obtain strictly more triplets than their opponent in the reduced setting. Mapping this back to the real game, in which the simulator is Player 1 and roles are reversed relative to the reduced play, yields

| T_{1} | > | T_{2} |

in the real game. Hence,

{Str}_{1}

is a winning strategy for Player 1, contradicting the assumption that Player 2 had a guaranteed win. Therefore, Player 2 cannot have a winning strategy. □

Appendix A.2. Proof of Theorem 2

Proof.

We must show two things: Under optimal play by both, the outcome is always a draw (

| T_{1} | = | T_{2} |

), and there is no single pairing function

ρ

that serves as a draw-forcing reply strategy for both players in all circumstances. To reduce the number of cases in our study, we exploit board symmetries (rotations and reflections) and adopt the following three assumptions, each of which aligns with a player’s best interest:

Each player completes a triplet if possible.
Each player intercepts the opponent from completing a triplet in their next move.
If both completing a triplet and intercepting the opponent’s threat are possible in their next move, the player may choose either action.

Existence of draw-forcing strategies for both players.

Here, we exhibit explicit draw-forcing reply strategies for each player (possibly using different pairings or opening moves). These guarantee

| T_{1} | = | T_{2} |

under best replies.

Player 1’s draw-forcing strategy. Opening move: Mark the center cell (

x_{1} = 5

). This is natural since it participates in four triplets and blocks many threats.

Pairing reply thereafter: Partition the remaining eight cells into four disjoint pairs so that whenever Player 2 plays in one cell of a pair, Player 1 replies in the paired cell. One convenient choice (up to symmetry) is as follows:

ρ : {1, 3, 4, 8} ⟷ {2, 6, 7, 9}, ρ (1) = 2, ρ (3) = 6, ρ (4) = 7, ρ (8) = 9 .

Concretely: after the initial move

x_{1} = 5

, whenever Player 2 plays y, Player 1 replies at the next decision time by

ρ (y)

if

y \in {1, 3, 4, 8}

, or

ρ^{- 1} (y)

if

y \in {2, 6, 7, 9}

.

To see that this scheme prevents any player from completing more triplets than the other, observe the following invariant, which is preserved after every Player 1 reply: immediately after Player 1’s reply move, each paired cell-set

{y, ρ (y)}

or

{y, ρ^{- 1} (y)}

contains either (i) zero marks of Player 1 and zero marks of Player 2, or (ii) one mark of Player 1 and one mark of Player 2 (if Player 2 just played in y and Player 1 replied in

ρ (y)

or

ρ^{- 1} (y)

). In particular, Player 2 never occupies both cells of a paired set without Player 1 occupying its partner on the immediately following move. Because every possible triplet (hyperedge) of the

3 \times 3

board for Player 2 intersects at least one such pair, completion of a triplet by Player 2 would require Player 2 to occupy every cell of some triplet before Player 1’s corrective reply. The invariant rules this out. Similarly, Player 1 cannot complete a triplet uncontested, given that Player 2’s earlier or intervening plays, together with the pairing replies, ensure that at least one required non-center cell is either already occupied by Player 2 or will be occupied as part of the mirroring response.

Consequently, under the center opening and the above pairing reply, no triplet ever becomes a unilateral completion. The final outcome satisfies

| T_{1} | = | T_{2} | = 0

under perfect mirroring. Although the outcome

| T_{1} | = | T_{2} | = 1

also constitutes a draw, the above pairing strategy cannot produce that case, given the assumption of intercepting the opponent from forming a triplet if possible. Thus, Player 1 can force a draw by this single pairing strategy (given the center opening).

Player 2’s draw-forcing strategies. Player 2 cannot control the opening cell of Player 1, so Player 2 must adapt their reply strategy based on Player 1’s first move. By symmetry of the board, there are three nonequivalent cases for Player 1’s opening:

x_{1}

is the center (5);

x_{1}

is a corner (e.g., 1);

x_{1}

is a side (e.g., 2).

In each case, Player 2 chooses a best response opening (to block maximally) and then uses a suitable pairing on the remaining cells to mirror Player 1’s subsequent moves, preventing Player 1 from completing strictly more triplets than Player 2.

Case 1:

x_{1} = 5

(center). A natural best response is

y_{1} = 1

(corner), blocking three triplets. Then partition

{2, 3, 4, 6, 7, 8, 9}

(excluding 5 and 1) into three pairs plus one extra cell, and mirror Player 1’s moves within those pairs. One explicit pairing (up to symmetry) is:

ρ : {2, 3, 4} ⟷ {8, 7, 6}, ρ (2) = 8, ρ (3) = 7, ρ (4) = 6 .

After

y_{1} = 1

, whenever Player 1 plays

x \in {2, 3, 4, 6, 7, 8}

, reply by

ρ (x)

or

ρ^{- 1} (x)

. If Player 1 ever plays the extra cell (9 in this case), Player 2 picks an arbitrary legal cell (which cannot allow Player 1 to complete more triplets by the pairing structure), and continues pairing on subsequent moves.

To justify this claim, note that the pairing guarantees that for every pair

{x, ρ (x)}

or

{x, ρ^{- 1} (x)}

, Player 1 cannot occupy both members of a pair because, after Player 1’s first move in that pair, Player 2 immediately occupies the partner on the next turn (because Player 2 mirrors immediately). Hence, Player 1 cannot accumulate two of the three required cells of any triplet that is covered by at least one such pair. The single extra cell is chosen so that any triplet containing it contains the required cells from at least one paired cell. In the case that the required cells come from only one paired cell, any attempt by Player 1 to use the extra cell to complete a triplet is neutralized by the invariant that paired sets are not simultaneously capturable by Player 1; in the case that the required cells come from two paired cells (one cell from each pair), pairing results in both players completing one triplet. Thus,

| T_{1} | = | T_{2} |

at termination in any case.

Case 2:

x_{1} = 1

(corner). Best response

y_{1} = 5

(center). Then pair among

{2, 3, 4, 6, 7, 8, 9}

. For example:

ρ : {2, 4, 6} ⟷ {3, 7, 8}, ρ (2) = 3, ρ (4) = 7, ρ (6) = 8,

with extra cell 9. Mirror Player 1’s subsequent moves via this

ρ

, arbitrarily respond if Player 1 plays the extra cell, and continue pairing.

A similar invariant argument as in Case 1 applies: mirroring ensures that Player 1 never occupies both members of any paired set without Player 2 having occupied the partner immediately after, but the single extra cell is chosen so that the arbitrary Player 2’s next move or the intercept move will trigger a series of intercepts by both players. Consequently, based on the assumption of intercepting the opponent from forming a triplet if possible, the pairing prevents Player 1 from obtaining strictly more triplets than Player 2.

Case 3:

x_{1} = 2

(side). Best response

y_{1} = 5

(center). Pair among

{1, 3, 4, 6, 7, 8, 9}

. For instance:

ρ : {1, 4, 6} ⟷ {3, 7, 9}, ρ (1) = 3, ρ (4) = 7, ρ (6) = 9,

with extra cell 8. By applying the same mirroring strategy and replying to any extra move with an arbitrary safe response, we ensure that

| T_{1} | = | T_{2} |

.

In each of these three cases, Player 2 has a pairing-based draw-forcing reply strategy. Thus, Player 2 can force a draw, although the pairing

ρ

depends on Player 1’s opening. This completes the demonstration that both players have draw-forcing strategies.

2.: Nonexistence of a single universal pairing $ρ$ for both roles.

To be a pairing-strategy draw, in our sense, there would need to exist one pairing function

ρ

on the nine cells such that: If Player 1 opens suitably (or even arbitrarily) and thereafter replies via

ρ

-mirroring, then Player 1 forces a draw against any Player 2 replies; simultaneously, if Player 2, in their turn, whenever Player 1 opens in any of the nine cells, uses the same

ρ

-mirroring reply scheme, then Player 2 also forces a draw against any Player 1 continuation.

However, from the case analyses above, Player 1’s draw-forcing pairing is tailored to the center-first opening, then mirror among four specific pairs. That same pairing does not yield a draw-forcing strategy for Player 2 if Player 1 opens elsewhere. Thus, assuming the existence of a single universal bijection

ρ

leads to an immediate contradiction with the explicit pairings that are necessary for draw-forcing in the distinct opening scenarios. Therefore, no bijection

ρ

can serve as a draw-forcing reply for both players in all circumstances.

Combining the above two parts, cumulative tic-tac-toe is a forced-draw game under optimal play, and yet it lacks a single universal pairing strategy usable by both players. Therefore, it is a strong draw but not a pairing-strategy draw. □

Appendix B. Algorithm Flowchart

Figure A1. Flowchart of the one-step TD algorithm with TCD initialization. Inputs are the step-size

α \in (0, 1]

, exploration rate

ϵ \in [0, 1)

, and two initial value functions

V^{1}, V^{2} : S \to [- 1, 1]

initialized by TCD for nonterminal states and initialized by +1/0/

- 1

according to

| T_{1} |

vs.

| T_{2} |

for terminal states. The outer loop repeats for each training episode. The inner loop repeats for each step of the episode until a terminal state is reached. At each step, the player i to move is determined. Then, a sample

u \sim Uniform (0, 1)

is drawn: if

u < 1 - ϵ

, the agent exploits by selecting

A = {arg max}_{a \in A (S)} V^{i} (Update (S, a))

, with ties breaking arbitrarily, otherwise the agent explores by choosing A uniformly at random. After executing A and observing successor state

S^{'} = Update (S, A)

, the value function is updated for Player i by the one-step TD rule

V^{i} (S) \leftarrow V^{i} (S) + α [V^{i} (S^{'}) - V^{i} (S)]

, and set the current state S to be the successor state

S^{'}

. The episode terminates when S is terminal. See Algorithm 1 for the full pseudocode.

Figure A1. Flowchart of the one-step TD algorithm with TCD initialization. Inputs are the step-size

α \in (0, 1]

, exploration rate

ϵ \in [0, 1)

, and two initial value functions

V^{1}, V^{2} : S \to [- 1, 1]

initialized by TCD for nonterminal states and initialized by +1/0/

- 1

according to

| T_{1} |

vs.

| T_{2} |

for terminal states. The outer loop repeats for each training episode. The inner loop repeats for each step of the episode until a terminal state is reached. At each step, the player i to move is determined. Then, a sample

u \sim Uniform (0, 1)

is drawn: if

u < 1 - ϵ

, the agent exploits by selecting

A = {arg max}_{a \in A (S)} V^{i} (Update (S, a))

, with ties breaking arbitrarily, otherwise the agent explores by choosing A uniformly at random. After executing A and observing successor state

S^{'} = Update (S, A)

, the value function is updated for Player i by the one-step TD rule

V^{i} (S) \leftarrow V^{i} (S) + α [V^{i} (S^{'}) - V^{i} (S)]

, and set the current state S to be the successor state

S^{'}

. The episode terminates when S is terminal. See Algorithm 1 for the full pseudocode.

Appendix C. Raw Convergence Data for Hyperparameter Tuning and Heuristic Comparison

Table A1. Raw episodes-to-convergence for each

(ϵ, α)

pair and each random seed (Seeds 121–125) using the zero initialization baseline. Convergence is defined by the sample variances of the cumulative win and draw rates being both less than

δ = 10^{- 8}

over a

W = 20, 000

-episode sliding window.

Table A1. Raw episodes-to-convergence for each

(ϵ, α)

pair and each random seed (Seeds 121–125) using the zero initialization baseline. Convergence is defined by the sample variances of the cumulative win and draw rates being both less than

δ = 10^{- 8}

over a

W = 20, 000

-episode sliding window.

$ϵ$	$α$	Seed 121	Seed 122	Seed 123	Seed 124	Seed 125
0.05	0.05	310,033	257,279	145,606	258,198	324,347
	0.10	387,648	154,423	300,944	167,846	213,967
	0.20	205,384	128,614	214,257	193,277	232,569
	0.50	178,748	160,856	210,678	169,766	169,192
0.10	0.05	219,702	310,615	371,382	212,151	632,958
	0.10	481,303	234,146	388,509	213,843	369,202
	0.20	245,052	220,414	262,556	167,146	204,613
	0.50	196,265	205,237	177,817	206,053	156,525
0.20	0.05	782,386	579,277	476,076	585,719	762,779
	0.10	351,709	377,120	334,538	439,090	292,565
	0.20	180,459	204,137	206,206	160,858	191,421
	0.50	170,843	212,443	171,355	233,151	237,589
0.30	0.05	731,808	562,911	711,081	767,505	826,691
	0.10	337,431	319,809	295,909	362,175	401,717
	0.20	237,782	186,920	190,416	182,962	216,065
	0.50	261,807	216,197	211,378	285,348	275,726

Table A2. Raw episodes-to-convergence for the comparison between the random baseline and TCD under the selected hyperparameters

ϵ^{*} = 0.05

and

α^{*} = 0.50

for each random seed (Seeds 126–130). Convergence is defined by the sample variances of the cumulative win and draw rates being both less than

δ = 10^{- 8}

over a

W = 20, 000

-episode sliding window.

Table A2. Raw episodes-to-convergence for the comparison between the random baseline and TCD under the selected hyperparameters

ϵ^{*} = 0.05

and

α^{*} = 0.50

for each random seed (Seeds 126–130). Convergence is defined by the sample variances of the cumulative win and draw rates being both less than

δ = 10^{- 8}

over a

W = 20, 000

-episode sliding window.

Initialization	Seed 126	Seed 127	Seed 128	Seed 129	Seed 130
Random Baseline	224,547	200,676	162,123	189,229	173,859
TCD Heuristic	195,185	149,118	112,231	123,535	151,241

References

Mayra, F. An Introduction to Game Studies: Games in Culture; Sage Publications Ltd.: Thousand Oaks, CA, USA, 2008. [Google Scholar]
Berlekamp, E.R.; Conway, J.H.; Guy, R.K. Winning Ways for Your Mathematical Plays, Volume 1, 2nd ed.; CRC Recreational Mathematics Series; A K Peters/CRC Press: Boca Raton, FL, USA, 2001. [Google Scholar] [CrossRef]
Berlekamp, E.R.; Conway, J.H.; Guy, R.K. Winning Ways for Your Mathematical Plays, Volume 2, 2nd ed.; CRC Recreational Mathematics Series; A K Peters/CRC Press: Boca Raton, FL, USA, 2003. [Google Scholar] [CrossRef]
Berlekamp, E.R.; Conway, J.H.; Guy, R.K. Winning Ways for Your Mathematical Plays, Volume 3, 2nd ed.; CRC Recreational Mathematics Series; A K Peters/CRC Press: Boca Raton, FL, USA, 2003. [Google Scholar] [CrossRef]
Berlekamp, E.R.; Conway, J.H.; Guy, R.K. Winning Ways for Your Mathematical Plays, Volume 4, 2nd ed.; CRC Recreational Mathematics Series; A K Peters/CRC Press: Boca Raton, FL, USA, 2004. [Google Scholar] [CrossRef]
Aalberg, T.; Strömbeck, J.; de Vreese, C.H. The framing of politics as strategy and game: A review of concepts, operationalizations and key findings. Journalism 2012, 13, 162–178. [Google Scholar] [CrossRef]
Boyle, E.; Connolly, T.; Hainey, T. The Role of Psychology in Understanding the Impact of Computer Games. Entertain. Comput. 2011, 2, 69–74. [Google Scholar] [CrossRef]
DiCicco-Bloom, B.; Gibson, D. More than a Game: Sociological Theory from the Theories of Games. Sociol. Theory 2010, 28, 247–271. [Google Scholar] [CrossRef]
Jaeger, G. Applications of Game Theory in Linguistics. Lang. Linguist. Compass 2008, 2, 406–421. [Google Scholar] [CrossRef]
Moreno-Ger, P.; Burgos, D.; Martínez-Ortiz, I.; Sierra, J.; Fernández-Manjón, B. Educational Game Design for Online Education. Comput. Hum. Behav. 2008, 24, 2530–2540. [Google Scholar] [CrossRef]
Zagal, J.P.; Tomuro, N.; Shepitsen, A. Natural Language Processing in Game Studies Research: An Overview. Simul. Gaming 2012, 43, 356–373. [Google Scholar] [CrossRef]
Lankoski, P.; Björk, S. Game Research Methods: An Overview; ETC Press: Pittsburgh, PA, USA, 2015. [Google Scholar] [CrossRef]
Lightner, J. A Brief Look at the History of Probability and Statistics. Math. Teach. 1991, 84, 623–630. [Google Scholar] [CrossRef]
Biggs, N.L.; Lloyd, E.; Wilson, R.J. Graph Theory 1736–1936, 2nd ed.; Clarendon Press: Oxford, UK, 1986. [Google Scholar]
Bishop, J.; Nasuto, S.; Tanay, T.; Roesch, E.; Spencer, M. HeX and the Single Anthill: Playing Games with Aunt Hillary. In Fundamental Issues of Artificial Intelligence; Springer International Publishing: Cham, Switzerland, 2016; pp. 369–390. [Google Scholar] [CrossRef]
Ewerhart, C. Backward Induction and the Game-Theoretic Analysis of Chess. Games Econ. Behav. 2002, 39, 206–214. [Google Scholar] [CrossRef][Green Version]
Schaeffer, J.; Burch, N.; Björnsson, Y.; Kishimoto, A.; Müller, M.; Lake, R.; Lu, P.; Sutphen, S. Checkers Is Solved. Science 2007, 317, 1518–1522. [Google Scholar] [CrossRef]
Burguillo, J. Using Game Theory and Competition-Based Learning to Stimulate Student Motivation and Performance. Comput. Educ. 2010, 55, 566–575. [Google Scholar] [CrossRef]
Camerer, C.F. Behavioral Game Theory: Experiments in Strategic Interaction; Princeton University Press: Princeton, NJ, USA, 2003. [Google Scholar]
Fudenberg, D.; Tirole, J. Game Theory; MIT Press: Cambridge, MA, USA, 1991. [Google Scholar]
Snidal, D. The Game Theory of International Politics. World Politics 1985, 38, 25–57. [Google Scholar] [CrossRef]
Vincent, T.L.; Brown, J.S. Evolutionary Game Theory, Natural Selection, and Darwinian Dynamics; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
Bosansky, B.; Kiekintveld, C.; Lisy, V.; Pechoucek, M. An Exact Double-Oracle Algorithm for Zero-Sum Extensive-Form Games with Imperfect Information. J. Artif. Intell. Res. 2014, 51, 829–866. [Google Scholar] [CrossRef]
van den Herik, H.; Uiterwijk, J.W.; van Rijswijck, J. Games Solved: Now and in the Future. Artif. Intell. 2002, 134, 277–311. [Google Scholar] [CrossRef]
Fraenkel, A.S. Combinatorial Games: Selected Bibliography with a Succinct Gourmet Introduction. Electron. J. Comb. 2012, 134, 277–311. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Aradi, S. Survey of Deep Reinforcement Learning for Motion Planning of Autonomous Vehicles. IEEE Trans. Intell. Transp. Syst. 2022, 23, 740–759. [Google Scholar] [CrossRef]
Guo, P.; Xiao, K.; Ye, Z.; Zhu, H.; Zhu, W. Intelligent Career Planning via Stochastic Subsampling Reinforcement Learning. Sci. Rep. 2022, 12, 8332. [Google Scholar] [CrossRef]
Kober, J.; Bagnell, J.; Peters, J. Reinforcement Learning in Robotics: A Survey. Int. J. Robot. Res. 2013, 32, 1238–1274. [Google Scholar] [CrossRef]
Lee, D.; Seo, H.; Jung, M. Neural Basis of Reinforcement Learning and Decision Making. Annu. Rev. Neurosci. 2012, 35, 287–308. [Google Scholar] [CrossRef]
Sahu, S.K.; Mokhade, A.; Bokde, N.D. An Overview of Machine Learning, Deep Learning, and Reinforcement Learning–Based Techniques in Quantitative Finance: Recent Progress and Challenges. Appl. Sci. 2023, 13, 1956. [Google Scholar] [CrossRef]
Xu, R.; Li, K.; Wang, H.; Kementzidis, G.; Zhu, W.; Deng, Y. RL-QESA: Reinforcement-Learning Quasi-Equilibrium Simulated Annealing. In Proceedings of the 2nd AI for Math Workshop @ ICML 2025, Vancouver, BC, Canada, 18 July 2025. [Google Scholar]
Li, Y. Deep Reinforcement Learning: An Overview. arXiv 2018, arXiv:1701.07274. [Google Scholar]
François-Lavet, V.; Henderson, P.; Islam, R.; Bellemare, M.G.; Pineau, J. An Introduction to Deep Reinforcement Learning. Found. Trends Mach. Learn. 2018, 11, 219–354. [Google Scholar] [CrossRef]
Crowley, K.; Siegler, R.S. Flexible Strategy Use in Young Children’s Tic-Tac-Toe. Cogn. Sci. 1993, 17, 531–561. [Google Scholar] [CrossRef]
Watson, J. Strategy: An Introduction to Game Theory, 3rd ed.; W. W. Norton & Company: New York, NY, USA, 2013. [Google Scholar]
Beck, J. Encyclopedia of Mathematics and its Applications. In Combinatorial Games: Tic-Tac-Toe Theory, 1st ed.; Cambridge University Press: Cambridge, UK, 2008; Volume 114. [Google Scholar]
Ho, J.; Huang, J.; Chang, B.; Liu, A.; Liu, Z. Reinforcement Learning: Playing Tic-Tac-Toe. J. Stud. Res. 2023, 11, 1–6. [Google Scholar] [CrossRef]
Kalra, B. Generalised Agent for Solving Higher Board States of Tic Tac Toe Using Reinforcement Learning. In 2022 Seventh International Conference on Parallel, Distributed and Grid Computing (PDGC); IEEE: Piscataway, NJ, USA, 2022; pp. 715–720. [Google Scholar] [CrossRef]
Peng, T.; Mao, J. Reinforcement Learning in Tic-Tac-Toe Game and Its Similar Variations In Dartmouth CS134 Final Report; Thayer School of Engineering at Dartmouth College: Hanover, NH, USA, 2009. [Google Scholar]
Veness, J.; Ng, K.S.; Hutter, M.; Uther, W.; Silver, D. A Monte-Carlo AIXI Approximation. J. Artif. Intell. Res. 2009, 40, 95–142. [Google Scholar] [CrossRef]
Vodopivec, T.; Samothrakis, S.; Ster, B. On Monte Carlo Tree Search and Reinforcement Learning. J. Artif. Intell. Res. 2017, 60, 881–936. [Google Scholar] [CrossRef]
Beaumont, K.; Collier, R. Do You Want to Play a Game? Learning to Play Tic-Tac-Toe in Hypermedia Environments. arXiv 2024, arXiv:2411.06398. [Google Scholar] [CrossRef]
C-Lara-Instance, C.; Rayner, M. Reinforcement Learning for Chain of Thought Reasoning: A Case Study Using Tic-Tac-Toe. Res. Prepr. 2024. [Google Scholar] [CrossRef]
Karmanova, E.; Serpiva, V.; Perminov, S.; Fedoseev, A.; Tsetserukou, D. SwarmPlay: Interactive Tic-tac-toe Board Game with Swarm of Nano-UAVs driven by Reinforcement Learning. arXiv 2021, arXiv:2108.01593. [Google Scholar]
Niveaditha, V.; Sulthana, P.; Gurusamy, B.; Doss, S. Word Based Tic-Tac-Toe Using Reinforcement Learning. In 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT); IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar] [CrossRef]
Zhu, T.; Ma, M.H. Deriving the Optimal Strategy for the Two Dice Pig Game via Reinforcement Learning. Stats 2022, 5, 805–818. [Google Scholar] [CrossRef]
Zhu, T.; Ma, M.H.; Chen, L.; Liu, Z. Optimal Strategy of the Simultaneous Dice Game Pig for Multiplayers: When Reinforcement Learning Meets Game Theory. Sci. Rep. 2023, 13, 8142. [Google Scholar] [CrossRef]
Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach, 4th ed.; Pearson Series in Artificial Intelligence; Pearson: London, UK, 2021. [Google Scholar]
Golom, S.; Hales, A. Hypercube Tic-Tac-Toe. In More Games of No Chance; Cambridge University Press: Cambridge, UK, 2002; pp. 167–182. [Google Scholar]
Hales, A.; Jewett, R. Regularity and Positional Games. Trans. Amer. Math. Soc. 1963, 106, 222–229. [Google Scholar] [CrossRef]
Hefetz, D.; Krivelevich, M.; Stojaković, M.; Szabó, T. Positional Games, 1st ed.; Oberwolfach Seminars; Springer: Basel, Switzerland, 2014. [Google Scholar] [CrossRef]
Goff, A. Quantum Tic-Tac-Toe: A Teaching Metaphor for Superposition in Quantum Mechanics. Am. J. Phys. 2006, 74, 962–973. [Google Scholar] [CrossRef]
Akyildiz, I.F.; Lee, W.Y.; Vuran, M.C.; Mohanty, S. NeXt generation/dynamic spectrum access/cognitive radio wireless networks: A survey. Comput. Netw. 2006, 50, 2127–2159. [Google Scholar] [CrossRef]
Sherman, M.; Mody, A.N.; Martinez, R.; Rodriguez, C.; Reddy, R. IEEE Standards Supporting Cognitive Radio and Networks, Dynamic Spectrum Access, and Coexistence. IEEE Commun. Mag. 2008, 46, 72–79. [Google Scholar] [CrossRef]
Maza, I.; Ollero, A. Multiple UAV Cooperative Searching Operation Using Polygon Area Decomposition and Efficient Coverage Algorithms. In Distributed Autonomous Robotic Systems 6; Springer: Tokyo, Japan, 2007; pp. 221–230. [Google Scholar] [CrossRef]
Banzhaf, J.F., III. One Man, 3.312 Votes: A Mathematical Analysis of the Electoral College. Villanova Law Rev. 1968, 13, 304–332. [Google Scholar]
Anthony, T.; Tian, Z.; Barber, D. Thinking Fast and Slow with Deep Learning and Tree Search. arXiv 2017, arXiv:1705.08439. [Google Scholar] [CrossRef]
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the Game of Go without Human Knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef]
Milgrom, P. Putting Auction Theory to Work: The Simultaneous Ascending Auction. J. Political Econ. 2000, 108, 245–272. [Google Scholar] [CrossRef]
Hotelling, H. Stability in Competition. Econ. J. 1929, 39, 41–57. [Google Scholar] [CrossRef]
Porter, M. Competitive Advantage: Creating and Sustaining Superior Performance; Free Press: New York, NY, USA, 1985. [Google Scholar]
Demaine, E.D.; Hearn, R.A. Playing Games with Algorithms: Algorithmic Combinatorial Game Theory. arXiv 2001, arXiv:cs.CC/0106019. [Google Scholar]
Wooldridge, M. Thinking Backward with Professor Zermelo. IEEE Intell. Syst. 2015, 30, 62–67. [Google Scholar] [CrossRef]
Bouton, C. Nim, a Game with a Complete Mathematical Theory. Ann. Math. 1901, 3, 35–39. [Google Scholar] [CrossRef]
Nash, J. Some Games and Machines for Playing Them; RAND Corporation: Santa Monica, CA, USA, 1952. [Google Scholar]
Conway, J.H. On Numbers and Games, 2nd ed.; A K Peters/CRC Press: Boca Raton, FL, USA, 2001. [Google Scholar] [CrossRef]
Albert, M.H.; Nowakowski, R.J.; Wolfe, D. Lessons in Play: An Introduction to Combinatorial Game Theory, 2nd ed.; A K Peters/CRC Press: Boca Raton, FL, USA, 2019; p. 344. [Google Scholar]
dos Santos, C.P. Embedding Processes in Combinatorial Game Theory. Discret. Appl. Math. 2011, 159, 675–682. [Google Scholar] [CrossRef][Green Version]
Kechris, A.S. Graduate Texts in Mathematics. In Classical Descriptive Set Theory; Springer: New York, NY, USA, 1995; Volume 156. [Google Scholar] [CrossRef]
Bodwin, G.; Grossman, O. Strategy-Stealing is Non-Constructive. arXiv 2019, arXiv:1911.06907. [Google Scholar] [CrossRef]
Schwalbe, U.; Walker, P. Zermelo and the Early History of Game Theory. Games Econ. Behav. 2001, 34, 123–137. [Google Scholar] [CrossRef]
Zermelo, E. Über eine Anwendung der Mengenlehre auf die Theorie des Schachspiels. In Proceedings of the Fifth International Congress of Mathematicians; Cambridge University Press: Cambridge, UK, 1913; pp. 501–504. [Google Scholar]
Boyan, J.; Moore, A. Learning Evaluation Functions to Improve Optimization by Local Search. J. Mach. Learn. Res. 2000, 1, 77–112. [Google Scholar] [CrossRef]
Lee, K.; Mahajan, S. A Pattern Classification Approach to Evaluation Function Learning. Artif. Intell. 1988, 36, 1–25. [Google Scholar] [CrossRef]
Marsland, T. Evaluation-Function Factors. ICGA J. 1985, 8, 47–57. [Google Scholar] [CrossRef]
Shannon, C.E. Programming a Computer for Playing Chess. In Computer Chess Compendium; Springer: New York, NY, USA, 1988; pp. 2–13. [Google Scholar] [CrossRef]
Shapiro, S.S.; Wilk, M.B. An analysis of variance test for normality (complete samples). Biometrika 1965, 52, 591–611. [Google Scholar] [CrossRef]
Levene, H. Robust tests for equality of variances. In Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling; Stanford University Press: Palo Alto, CA, USA, 1960; Volume 2, pp. 278–292. [Google Scholar]
Student. The Probable Error of a Mean. Biometrika 1908, 6, 1–25. [Google Scholar] [CrossRef]

Figure 1. Six real-world analogues of the cumulative tic-tac-toe framework, spanning competitive resource allocation, multi-agent planning, coalition-building in voting, adversarial spatial reinforcement learning (RL), auction mechanisms, and market positioning. The central

3 \times 3

grid is a schematic cumulative tic-tac-toe board (X and O denote the two players’ marks); surrounding icons are illustrative.

Figure 1. Six real-world analogues of the cumulative tic-tac-toe framework, spanning competitive resource allocation, multi-agent planning, coalition-building in voting, adversarial spatial reinforcement learning (RL), auction mechanisms, and market positioning. The central

3 \times 3

grid is a schematic cumulative tic-tac-toe board (X and O denote the two players’ marks); surrounding icons are illustrative.

Table 1. A detailed table summarizing the similarities and differences between the classic tic-tac-toe and cumulative tic-tac-toe.

Aspect	Classic Tic-Tac-Toe	Cumulative Tic-Tac-Toe
Termination	Play ends immediately upon triplet or full board.	Play continues until all cells are occupied.
Win Condition	First to make a completed triplet wins; if the board fills with no triplet, it is a draw.	Player with the highest total number of triplets wins; equal total yields in a draw.
Play Length	At most nine moves; may end earlier when someone wins.	Exactly nine moves (all cells filled).
Outcome Possibilities	Win, draw, or lose.	Win, draw, or lose.
Strategy Complexity	Relatively simple; the game is solved, and optimal play leads to a draw.	Higher complexity, as each placement can contribute to multiple triplets; potentially richer mid- and endgame decisions.
Game Solved Status	Solved as a draw under optimal play.	Not yet solved; an explicit optimal strategy remains unknown.

Table 2. A summary table highlighting how the triplet-coverage difference (TCD) evaluation function accelerates training without performance degradation, and how RL and human–machine matches confirm the theoretical draw outcome.

Experiment	Setup	Metric	Outcome
1. Effect of evaluation functions	One-step (temporal-difference) TD, $ϵ^{} = 0.05$ , $α^{} = 0.50$ ; win and draw rates tracked over a sliding window of $W = 20, 000$ episodes	Episodes to convergence (sample variances of cumulative win and draw rates less than $δ = 10^{- 8}$ )	Average random baseline: 190,087 episodes Average TCD heuristic: 146,262 episodes ( $- 23.1 %$ )
2. Consistency with combinatorial game theory (CGT)	Frozen policies (no exploration); 100 head-to-head games	Fraction of draws	100% draws (matches CGT prediction of a draw under optimal play)
3. Evaluation against human opponents	Human participants vs. converged TCD policy (tested as Player 1 and Player 2)	Human win rate	0% human wins across all trials (aligns with RL optimal draw policy and CGT theory)

Table 3. Sample mean (± sample standard deviation) number of episodes to convergence under zero initialization for each

(ϵ, α)

pair (five independent seeds). Convergence is declared when the sample variances of cumulative win and draw rates over a 20,000-episode sliding window fall below

10^{- 8}

. All values are rounded to the nearest integer.

Table 3. Sample mean (± sample standard deviation) number of episodes to convergence under zero initialization for each

(ϵ, α)

pair (five independent seeds). Convergence is declared when the sample variances of cumulative win and draw rates over a 20,000-episode sliding window fall below

10^{- 8}

. All values are rounded to the nearest integer.

$ϵ ∖ α$	0.05	0.10	0.20	0.50
0.05	259,093 (±70,243)	244,966 (±98,232)	194,820 (±39,686)	177,848 (±19,414)
0.10	349,362 (±171,793)	337,401 (±112,093)	219,956 (±36,977)	188,379 (±21,125)
0.20	637,247 (±131,163)	359,004 (±54,332)	188,616 (±18,676)	205,076 (±32,437)
0.30	719,999 (±98,152)	343,408 (±40,623)	202,829 (±23,446)	250,091 (±34,223)

Table 4. Sample mean and sample standard deviation number of episodes to convergence under

ϵ^{*} = 0.05

,

α^{*} = 0.50

, comparing random initialization vs. the TCD heuristic (five new independent seeds). Convergence is declared when the sample variances of cumulative win and draw rates over a 20,000-episode window fall below

10^{- 8}

. All values are rounded to the nearest integer except for the percentage reduction.

Table 4. Sample mean and sample standard deviation number of episodes to convergence under

ϵ^{*} = 0.05

,

α^{*} = 0.50

, comparing random initialization vs. the TCD heuristic (five new independent seeds). Convergence is declared when the sample variances of cumulative win and draw rates over a 20,000-episode window fall below

10^{- 8}

. All values are rounded to the nearest integer except for the percentage reduction.

Initialization	Mean Episodes	Standard Deviation	Reduction vs. Random
Random initialization (baseline)	190,087	24,216
TCD heuristic	146,262	32,020	≈ $- 23.1$ %

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, K.; Zhu, W. Combinatorial Game Theory and Reinforcement Learning in Cumulative Tic-Tac-Toe via Evaluation Functions. Stats 2026, 9, 28. https://doi.org/10.3390/stats9020028

AMA Style

Li K, Zhu W. Combinatorial Game Theory and Reinforcement Learning in Cumulative Tic-Tac-Toe via Evaluation Functions. Stats. 2026; 9(2):28. https://doi.org/10.3390/stats9020028

Chicago/Turabian Style

Li, Kai, and Wei Zhu. 2026. "Combinatorial Game Theory and Reinforcement Learning in Cumulative Tic-Tac-Toe via Evaluation Functions" Stats 9, no. 2: 28. https://doi.org/10.3390/stats9020028

APA Style

Li, K., & Zhu, W. (2026). Combinatorial Game Theory and Reinforcement Learning in Cumulative Tic-Tac-Toe via Evaluation Functions. Stats, 9(2), 28. https://doi.org/10.3390/stats9020028

Article Menu

Combinatorial Game Theory and Reinforcement Learning in Cumulative Tic-Tac-Toe via Evaluation Functions

Abstract

1. Introduction

2. Cumulative Tic-Tac-Toe

2.1. Rules of the Game

2.2. Real-World Scenarios

3. Empirical Methodology

3.1. Combinatorial Game Theory

3.1.1. Formal Hypergraph Representation

3.1.2. Play Outcomes

3.1.3. Game Determinacy

3.1.4. Game Outcomes

3.2. Reinforcement Learning

3.2.1. Markov Decision Process

3.2.2. Formal MDP Representation

States and Actions

Rewards

Policies and Value Functions

Optimality

3.2.3. Temporal-Difference Learning

Initialization

Action Selection

Value Update

3.2.4. Evaluation Functions

Triplet-Coverage Difference

3.2.5. Algorithm

3.2.6. Alternative Evaluation Functions

Weighted Triplet-Coverage Difference

Two-Step Look-Ahead

4. Results

4.1. Execution Environment and Reproducibility

4.2. Experiment 1: Effect of the Evaluation Functions

Stage 1: hyperparameter tuning under zero initialization

Stage 2: initialization comparison (random vs. TCD)

4.3. Experiment 2: Consistency with Combinatorial Game Theory

4.4. Experiment 3: Evaluation Against Human Opponents

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Proofs of Theorems

Appendix A.1. Proof of Theorem 1

Appendix A.2. Proof of Theorem 2

Appendix B. Algorithm Flowchart

Appendix C. Raw Convergence Data for Hyperparameter Tuning and Heuristic Comparison

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI