Combinatorial Game Theory and Reinforcement Learning in Cumulative Tic-Tac-Toe via Evaluation Functions
Abstract
1. Introduction
2. Cumulative Tic-Tac-Toe
2.1. Rules of the Game
2.2. Real-World Scenarios
3. Empirical Methodology
3.1. Combinatorial Game Theory
3.1.1. Formal Hypergraph Representation
3.1.2. Play Outcomes
- Player 1 wins if .
- Player 2 wins if .
- Otherwise, , the game is a draw.
3.1.3. Game Determinacy
3.1.4. Game Outcomes
- Player 1 trivial win: Regardless of how both players behave, every play ends with . Equivalently, every play is a Player 1 win.
- Player 1 forced win without draw possibility: Under optimal play by both, every play ends with . However, there exist suboptimal plays where but never . Equivalently, Player 1 has a forced win, and no drawing play can occur under every play.
- Player 1 forced win but draws possible if suboptimal: Player 1 has a winning strategy, so under optimal play, we always have , yet there exist suboptimal plays where .
- Strong draw (no single universal pairing strategy draw): Under optimal play by both, the result is always a draw , but there is no single pairing function such that whichever player uses the corresponding pairing-based reply strategy can force a draw against every possible opponent reply.
- Pairing strategy draw: Under optimal play by both parties, the game is always a draw, and there exists a single pairing function such that if Player 1 (Player 2) employs the pairing reply strategy based on , then irrespective of how the opponent plays legally, the result is a draw. In other words, the same yields a draw-forcing strategy for either role.
3.2. Reinforcement Learning
3.2.1. Markov Decision Process
3.2.2. Formal MDP Representation
States and Actions
Rewards
Policies and Value Functions
Optimality
3.2.3. Temporal-Difference Learning
Initialization
Action Selection
- Exploitation: select a greedy action whose successor state currently has the highest estimated value, with ties broken arbitrarily.
- Exploration: with probability , select a random action, thereby refining estimates for suboptimal moves that may conceal tactical traps.
Value Update
3.2.4. Evaluation Functions
Triplet-Coverage Difference
3.2.5. Algorithm
| Algorithm 1: One-Step TD with TCD Initialization |
![]() |
3.2.6. Alternative Evaluation Functions
Weighted Triplet-Coverage Difference
Two-Step Look-Ahead
4. Results
- 1.
- Effect of evaluation functions. How much does seeding RL with domain-informed heuristics (in particular, TCD) accelerate convergence compared with a naive random initialization in the RL literature?
- 2.
- Consistency with CGT. Once converged, do two frozen policies produce draws when played head-to-head, matching the CGT determinacy result?
- 3.
- Evaluation against human opponents. How does the converged policy fare against human opponents, and does it confirm the theoretical first-player/second-player draw outcome in practice?
4.1. Execution Environment and Reproducibility
4.2. Experiment 1: Effect of the Evaluation Functions
Stage 1: hyperparameter tuning under zero initialization
Stage 2: initialization comparison (random vs. TCD)
- Random initialization (baseline): For all nonterminal ,
- TCD heuristic: For each nonterminal state s, we computed the normalized TCD for the player-to-move. We then initialized the tables by setting
- Independence check: By design, each seed run is an independent experiment, so the two groups (random vs. TCD) are independent.
- Normality checks (Shapiro–Wilk test): The Shapiro–Wilk test [78] tests the null hypothesis that the data are from a normal distribution. We have p-value = 0.9382 for the random initialization sample and p-value = 0.6096 for the TCD sample. We fail to reject the normality assumption for both samples at the 5% level of significance.
- Equal-variance check (Levene’s test): Levene’s test [79] tests the null hypothesis that all input samples are from populations with equal variances. We have p-value = 0.7088. We fail to reject the homoscedasticity assumption at the 5% level.
- Two-sample t test: After confirming the above assumptions, we can proceed with a standard t test. A two-sample t test [80] is a test for the null hypothesis that two independent samples have identical means, assuming that the populations have identical variances. We have an upper one-sided p-value = 0.0203. We reject the null that the two population means are the same, in favor of a significant reduction at the level of 5%.
4.3. Experiment 2: Consistency with Combinatorial Game Theory
4.4. Experiment 3: Evaluation Against Human Opponents
5. Discussion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| 2SLA | Two-step look-ahead |
| CGT | Combinatorial game theory |
| DP | Dynamic programming |
| i.i.d. | Independent and identically distributed |
| KSLA | k-step look-ahead |
| MC | Monte Carlo |
| MDP | Markov decision process |
| RL | Reinforcement learning |
| TCD | Triplet-coverage difference |
| TD | Temporal-difference |
| WTCD | Weighted triplet-coverage difference |
Appendix A. Proofs of Theorems
Appendix A.1. Proof of Theorem 1
- General determinacy.
- 2.
- Eliminate Player 2 wins.
Appendix A.2. Proof of Theorem 2
- Each player completes a triplet if possible.
- Each player intercepts the opponent from completing a triplet in their next move.
- If both completing a triplet and intercepting the opponent’s threat are possible in their next move, the player may choose either action.
- Existence of draw-forcing strategies for both players.
- 2.
- Nonexistence of a single universal pairing for both roles.
Appendix B. Algorithm Flowchart

Appendix C. Raw Convergence Data for Hyperparameter Tuning and Heuristic Comparison
| Seed 121 | Seed 122 | Seed 123 | Seed 124 | Seed 125 | ||
|---|---|---|---|---|---|---|
| 0.05 | 0.05 | 310,033 | 257,279 | 145,606 | 258,198 | 324,347 |
| 0.10 | 387,648 | 154,423 | 300,944 | 167,846 | 213,967 | |
| 0.20 | 205,384 | 128,614 | 214,257 | 193,277 | 232,569 | |
| 0.50 | 178,748 | 160,856 | 210,678 | 169,766 | 169,192 | |
| 0.10 | 0.05 | 219,702 | 310,615 | 371,382 | 212,151 | 632,958 |
| 0.10 | 481,303 | 234,146 | 388,509 | 213,843 | 369,202 | |
| 0.20 | 245,052 | 220,414 | 262,556 | 167,146 | 204,613 | |
| 0.50 | 196,265 | 205,237 | 177,817 | 206,053 | 156,525 | |
| 0.20 | 0.05 | 782,386 | 579,277 | 476,076 | 585,719 | 762,779 |
| 0.10 | 351,709 | 377,120 | 334,538 | 439,090 | 292,565 | |
| 0.20 | 180,459 | 204,137 | 206,206 | 160,858 | 191,421 | |
| 0.50 | 170,843 | 212,443 | 171,355 | 233,151 | 237,589 | |
| 0.30 | 0.05 | 731,808 | 562,911 | 711,081 | 767,505 | 826,691 |
| 0.10 | 337,431 | 319,809 | 295,909 | 362,175 | 401,717 | |
| 0.20 | 237,782 | 186,920 | 190,416 | 182,962 | 216,065 | |
| 0.50 | 261,807 | 216,197 | 211,378 | 285,348 | 275,726 |
| Initialization | Seed 126 | Seed 127 | Seed 128 | Seed 129 | Seed 130 |
|---|---|---|---|---|---|
| Random Baseline | 224,547 | 200,676 | 162,123 | 189,229 | 173,859 |
| TCD Heuristic | 195,185 | 149,118 | 112,231 | 123,535 | 151,241 |
References
- Mayra, F. An Introduction to Game Studies: Games in Culture; Sage Publications Ltd.: Thousand Oaks, CA, USA, 2008. [Google Scholar]
- Berlekamp, E.R.; Conway, J.H.; Guy, R.K. Winning Ways for Your Mathematical Plays, Volume 1, 2nd ed.; CRC Recreational Mathematics Series; A K Peters/CRC Press: Boca Raton, FL, USA, 2001. [Google Scholar] [CrossRef]
- Berlekamp, E.R.; Conway, J.H.; Guy, R.K. Winning Ways for Your Mathematical Plays, Volume 2, 2nd ed.; CRC Recreational Mathematics Series; A K Peters/CRC Press: Boca Raton, FL, USA, 2003. [Google Scholar] [CrossRef]
- Berlekamp, E.R.; Conway, J.H.; Guy, R.K. Winning Ways for Your Mathematical Plays, Volume 3, 2nd ed.; CRC Recreational Mathematics Series; A K Peters/CRC Press: Boca Raton, FL, USA, 2003. [Google Scholar] [CrossRef]
- Berlekamp, E.R.; Conway, J.H.; Guy, R.K. Winning Ways for Your Mathematical Plays, Volume 4, 2nd ed.; CRC Recreational Mathematics Series; A K Peters/CRC Press: Boca Raton, FL, USA, 2004. [Google Scholar] [CrossRef]
- Aalberg, T.; Strömbeck, J.; de Vreese, C.H. The framing of politics as strategy and game: A review of concepts, operationalizations and key findings. Journalism 2012, 13, 162–178. [Google Scholar] [CrossRef]
- Boyle, E.; Connolly, T.; Hainey, T. The Role of Psychology in Understanding the Impact of Computer Games. Entertain. Comput. 2011, 2, 69–74. [Google Scholar] [CrossRef]
- DiCicco-Bloom, B.; Gibson, D. More than a Game: Sociological Theory from the Theories of Games. Sociol. Theory 2010, 28, 247–271. [Google Scholar] [CrossRef]
- Jaeger, G. Applications of Game Theory in Linguistics. Lang. Linguist. Compass 2008, 2, 406–421. [Google Scholar] [CrossRef]
- Moreno-Ger, P.; Burgos, D.; Martínez-Ortiz, I.; Sierra, J.; Fernández-Manjón, B. Educational Game Design for Online Education. Comput. Hum. Behav. 2008, 24, 2530–2540. [Google Scholar] [CrossRef]
- Zagal, J.P.; Tomuro, N.; Shepitsen, A. Natural Language Processing in Game Studies Research: An Overview. Simul. Gaming 2012, 43, 356–373. [Google Scholar] [CrossRef]
- Lankoski, P.; Björk, S. Game Research Methods: An Overview; ETC Press: Pittsburgh, PA, USA, 2015. [Google Scholar] [CrossRef]
- Lightner, J. A Brief Look at the History of Probability and Statistics. Math. Teach. 1991, 84, 623–630. [Google Scholar] [CrossRef]
- Biggs, N.L.; Lloyd, E.; Wilson, R.J. Graph Theory 1736–1936, 2nd ed.; Clarendon Press: Oxford, UK, 1986. [Google Scholar]
- Bishop, J.; Nasuto, S.; Tanay, T.; Roesch, E.; Spencer, M. HeX and the Single Anthill: Playing Games with Aunt Hillary. In Fundamental Issues of Artificial Intelligence; Springer International Publishing: Cham, Switzerland, 2016; pp. 369–390. [Google Scholar] [CrossRef]
- Ewerhart, C. Backward Induction and the Game-Theoretic Analysis of Chess. Games Econ. Behav. 2002, 39, 206–214. [Google Scholar] [CrossRef][Green Version]
- Schaeffer, J.; Burch, N.; Björnsson, Y.; Kishimoto, A.; Müller, M.; Lake, R.; Lu, P.; Sutphen, S. Checkers Is Solved. Science 2007, 317, 1518–1522. [Google Scholar] [CrossRef]
- Burguillo, J. Using Game Theory and Competition-Based Learning to Stimulate Student Motivation and Performance. Comput. Educ. 2010, 55, 566–575. [Google Scholar] [CrossRef]
- Camerer, C.F. Behavioral Game Theory: Experiments in Strategic Interaction; Princeton University Press: Princeton, NJ, USA, 2003. [Google Scholar]
- Fudenberg, D.; Tirole, J. Game Theory; MIT Press: Cambridge, MA, USA, 1991. [Google Scholar]
- Snidal, D. The Game Theory of International Politics. World Politics 1985, 38, 25–57. [Google Scholar] [CrossRef]
- Vincent, T.L.; Brown, J.S. Evolutionary Game Theory, Natural Selection, and Darwinian Dynamics; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
- Bosansky, B.; Kiekintveld, C.; Lisy, V.; Pechoucek, M. An Exact Double-Oracle Algorithm for Zero-Sum Extensive-Form Games with Imperfect Information. J. Artif. Intell. Res. 2014, 51, 829–866. [Google Scholar] [CrossRef]
- van den Herik, H.; Uiterwijk, J.W.; van Rijswijck, J. Games Solved: Now and in the Future. Artif. Intell. 2002, 134, 277–311. [Google Scholar] [CrossRef]
- Fraenkel, A.S. Combinatorial Games: Selected Bibliography with a Succinct Gourmet Introduction. Electron. J. Comb. 2012, 134, 277–311. [Google Scholar] [CrossRef]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Aradi, S. Survey of Deep Reinforcement Learning for Motion Planning of Autonomous Vehicles. IEEE Trans. Intell. Transp. Syst. 2022, 23, 740–759. [Google Scholar] [CrossRef]
- Guo, P.; Xiao, K.; Ye, Z.; Zhu, H.; Zhu, W. Intelligent Career Planning via Stochastic Subsampling Reinforcement Learning. Sci. Rep. 2022, 12, 8332. [Google Scholar] [CrossRef]
- Kober, J.; Bagnell, J.; Peters, J. Reinforcement Learning in Robotics: A Survey. Int. J. Robot. Res. 2013, 32, 1238–1274. [Google Scholar] [CrossRef]
- Lee, D.; Seo, H.; Jung, M. Neural Basis of Reinforcement Learning and Decision Making. Annu. Rev. Neurosci. 2012, 35, 287–308. [Google Scholar] [CrossRef]
- Sahu, S.K.; Mokhade, A.; Bokde, N.D. An Overview of Machine Learning, Deep Learning, and Reinforcement Learning–Based Techniques in Quantitative Finance: Recent Progress and Challenges. Appl. Sci. 2023, 13, 1956. [Google Scholar] [CrossRef]
- Xu, R.; Li, K.; Wang, H.; Kementzidis, G.; Zhu, W.; Deng, Y. RL-QESA: Reinforcement-Learning Quasi-Equilibrium Simulated Annealing. In Proceedings of the 2nd AI for Math Workshop @ ICML 2025, Vancouver, BC, Canada, 18 July 2025. [Google Scholar]
- Li, Y. Deep Reinforcement Learning: An Overview. arXiv 2018, arXiv:1701.07274. [Google Scholar]
- François-Lavet, V.; Henderson, P.; Islam, R.; Bellemare, M.G.; Pineau, J. An Introduction to Deep Reinforcement Learning. Found. Trends Mach. Learn. 2018, 11, 219–354. [Google Scholar] [CrossRef]
- Crowley, K.; Siegler, R.S. Flexible Strategy Use in Young Children’s Tic-Tac-Toe. Cogn. Sci. 1993, 17, 531–561. [Google Scholar] [CrossRef]
- Watson, J. Strategy: An Introduction to Game Theory, 3rd ed.; W. W. Norton & Company: New York, NY, USA, 2013. [Google Scholar]
- Beck, J. Encyclopedia of Mathematics and its Applications. In Combinatorial Games: Tic-Tac-Toe Theory, 1st ed.; Cambridge University Press: Cambridge, UK, 2008; Volume 114. [Google Scholar]
- Ho, J.; Huang, J.; Chang, B.; Liu, A.; Liu, Z. Reinforcement Learning: Playing Tic-Tac-Toe. J. Stud. Res. 2023, 11, 1–6. [Google Scholar] [CrossRef]
- Kalra, B. Generalised Agent for Solving Higher Board States of Tic Tac Toe Using Reinforcement Learning. In 2022 Seventh International Conference on Parallel, Distributed and Grid Computing (PDGC); IEEE: Piscataway, NJ, USA, 2022; pp. 715–720. [Google Scholar] [CrossRef]
- Peng, T.; Mao, J. Reinforcement Learning in Tic-Tac-Toe Game and Its Similar Variations In Dartmouth CS134 Final Report; Thayer School of Engineering at Dartmouth College: Hanover, NH, USA, 2009. [Google Scholar]
- Veness, J.; Ng, K.S.; Hutter, M.; Uther, W.; Silver, D. A Monte-Carlo AIXI Approximation. J. Artif. Intell. Res. 2009, 40, 95–142. [Google Scholar] [CrossRef]
- Vodopivec, T.; Samothrakis, S.; Ster, B. On Monte Carlo Tree Search and Reinforcement Learning. J. Artif. Intell. Res. 2017, 60, 881–936. [Google Scholar] [CrossRef]
- Beaumont, K.; Collier, R. Do You Want to Play a Game? Learning to Play Tic-Tac-Toe in Hypermedia Environments. arXiv 2024, arXiv:2411.06398. [Google Scholar] [CrossRef]
- C-Lara-Instance, C.; Rayner, M. Reinforcement Learning for Chain of Thought Reasoning: A Case Study Using Tic-Tac-Toe. Res. Prepr. 2024. [Google Scholar] [CrossRef]
- Karmanova, E.; Serpiva, V.; Perminov, S.; Fedoseev, A.; Tsetserukou, D. SwarmPlay: Interactive Tic-tac-toe Board Game with Swarm of Nano-UAVs driven by Reinforcement Learning. arXiv 2021, arXiv:2108.01593. [Google Scholar]
- Niveaditha, V.; Sulthana, P.; Gurusamy, B.; Doss, S. Word Based Tic-Tac-Toe Using Reinforcement Learning. In 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT); IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar] [CrossRef]
- Zhu, T.; Ma, M.H. Deriving the Optimal Strategy for the Two Dice Pig Game via Reinforcement Learning. Stats 2022, 5, 805–818. [Google Scholar] [CrossRef]
- Zhu, T.; Ma, M.H.; Chen, L.; Liu, Z. Optimal Strategy of the Simultaneous Dice Game Pig for Multiplayers: When Reinforcement Learning Meets Game Theory. Sci. Rep. 2023, 13, 8142. [Google Scholar] [CrossRef]
- Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach, 4th ed.; Pearson Series in Artificial Intelligence; Pearson: London, UK, 2021. [Google Scholar]
- Golom, S.; Hales, A. Hypercube Tic-Tac-Toe. In More Games of No Chance; Cambridge University Press: Cambridge, UK, 2002; pp. 167–182. [Google Scholar]
- Hales, A.; Jewett, R. Regularity and Positional Games. Trans. Amer. Math. Soc. 1963, 106, 222–229. [Google Scholar] [CrossRef]
- Hefetz, D.; Krivelevich, M.; Stojaković, M.; Szabó, T. Positional Games, 1st ed.; Oberwolfach Seminars; Springer: Basel, Switzerland, 2014. [Google Scholar] [CrossRef]
- Goff, A. Quantum Tic-Tac-Toe: A Teaching Metaphor for Superposition in Quantum Mechanics. Am. J. Phys. 2006, 74, 962–973. [Google Scholar] [CrossRef]
- Akyildiz, I.F.; Lee, W.Y.; Vuran, M.C.; Mohanty, S. NeXt generation/dynamic spectrum access/cognitive radio wireless networks: A survey. Comput. Netw. 2006, 50, 2127–2159. [Google Scholar] [CrossRef]
- Sherman, M.; Mody, A.N.; Martinez, R.; Rodriguez, C.; Reddy, R. IEEE Standards Supporting Cognitive Radio and Networks, Dynamic Spectrum Access, and Coexistence. IEEE Commun. Mag. 2008, 46, 72–79. [Google Scholar] [CrossRef]
- Maza, I.; Ollero, A. Multiple UAV Cooperative Searching Operation Using Polygon Area Decomposition and Efficient Coverage Algorithms. In Distributed Autonomous Robotic Systems 6; Springer: Tokyo, Japan, 2007; pp. 221–230. [Google Scholar] [CrossRef]
- Banzhaf, J.F., III. One Man, 3.312 Votes: A Mathematical Analysis of the Electoral College. Villanova Law Rev. 1968, 13, 304–332. [Google Scholar]
- Anthony, T.; Tian, Z.; Barber, D. Thinking Fast and Slow with Deep Learning and Tree Search. arXiv 2017, arXiv:1705.08439. [Google Scholar] [CrossRef]
- Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the Game of Go without Human Knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef]
- Milgrom, P. Putting Auction Theory to Work: The Simultaneous Ascending Auction. J. Political Econ. 2000, 108, 245–272. [Google Scholar] [CrossRef]
- Hotelling, H. Stability in Competition. Econ. J. 1929, 39, 41–57. [Google Scholar] [CrossRef]
- Porter, M. Competitive Advantage: Creating and Sustaining Superior Performance; Free Press: New York, NY, USA, 1985. [Google Scholar]
- Demaine, E.D.; Hearn, R.A. Playing Games with Algorithms: Algorithmic Combinatorial Game Theory. arXiv 2001, arXiv:cs.CC/0106019. [Google Scholar]
- Wooldridge, M. Thinking Backward with Professor Zermelo. IEEE Intell. Syst. 2015, 30, 62–67. [Google Scholar] [CrossRef]
- Bouton, C. Nim, a Game with a Complete Mathematical Theory. Ann. Math. 1901, 3, 35–39. [Google Scholar] [CrossRef]
- Nash, J. Some Games and Machines for Playing Them; RAND Corporation: Santa Monica, CA, USA, 1952. [Google Scholar]
- Conway, J.H. On Numbers and Games, 2nd ed.; A K Peters/CRC Press: Boca Raton, FL, USA, 2001. [Google Scholar] [CrossRef]
- Albert, M.H.; Nowakowski, R.J.; Wolfe, D. Lessons in Play: An Introduction to Combinatorial Game Theory, 2nd ed.; A K Peters/CRC Press: Boca Raton, FL, USA, 2019; p. 344. [Google Scholar]
- dos Santos, C.P. Embedding Processes in Combinatorial Game Theory. Discret. Appl. Math. 2011, 159, 675–682. [Google Scholar] [CrossRef][Green Version]
- Kechris, A.S. Graduate Texts in Mathematics. In Classical Descriptive Set Theory; Springer: New York, NY, USA, 1995; Volume 156. [Google Scholar] [CrossRef]
- Bodwin, G.; Grossman, O. Strategy-Stealing is Non-Constructive. arXiv 2019, arXiv:1911.06907. [Google Scholar] [CrossRef]
- Schwalbe, U.; Walker, P. Zermelo and the Early History of Game Theory. Games Econ. Behav. 2001, 34, 123–137. [Google Scholar] [CrossRef]
- Zermelo, E. Über eine Anwendung der Mengenlehre auf die Theorie des Schachspiels. In Proceedings of the Fifth International Congress of Mathematicians; Cambridge University Press: Cambridge, UK, 1913; pp. 501–504. [Google Scholar]
- Boyan, J.; Moore, A. Learning Evaluation Functions to Improve Optimization by Local Search. J. Mach. Learn. Res. 2000, 1, 77–112. [Google Scholar] [CrossRef]
- Lee, K.; Mahajan, S. A Pattern Classification Approach to Evaluation Function Learning. Artif. Intell. 1988, 36, 1–25. [Google Scholar] [CrossRef]
- Marsland, T. Evaluation-Function Factors. ICGA J. 1985, 8, 47–57. [Google Scholar] [CrossRef]
- Shannon, C.E. Programming a Computer for Playing Chess. In Computer Chess Compendium; Springer: New York, NY, USA, 1988; pp. 2–13. [Google Scholar] [CrossRef]
- Shapiro, S.S.; Wilk, M.B. An analysis of variance test for normality (complete samples). Biometrika 1965, 52, 591–611. [Google Scholar] [CrossRef]
- Levene, H. Robust tests for equality of variances. In Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling; Stanford University Press: Palo Alto, CA, USA, 1960; Volume 2, pp. 278–292. [Google Scholar]
- Student. The Probable Error of a Mean. Biometrika 1908, 6, 1–25. [Google Scholar] [CrossRef]

| Aspect | Classic Tic-Tac-Toe | Cumulative Tic-Tac-Toe |
|---|---|---|
| Termination | Play ends immediately upon triplet or full board. | Play continues until all cells are occupied. |
| Win Condition | First to make a completed triplet wins; if the board fills with no triplet, it is a draw. | Player with the highest total number of triplets wins; equal total yields in a draw. |
| Play Length | At most nine moves; may end earlier when someone wins. | Exactly nine moves (all cells filled). |
| Outcome Possibilities | Win, draw, or lose. | Win, draw, or lose. |
| Strategy Complexity | Relatively simple; the game is solved, and optimal play leads to a draw. | Higher complexity, as each placement can contribute to multiple triplets; potentially richer mid- and endgame decisions. |
| Game Solved Status | Solved as a draw under optimal play. | Not yet solved; an explicit optimal strategy remains unknown. |
| Experiment | Setup | Metric | Outcome |
|---|---|---|---|
| 1. Effect of evaluation functions | One-step (temporal-difference) TD, , ; win and draw rates tracked over a sliding window of episodes | Episodes to convergence (sample variances of cumulative win and draw rates less than ) | Average random baseline: 190,087 episodes Average TCD heuristic: 146,262 episodes () |
| 2. Consistency with combinatorial game theory (CGT) | Frozen policies (no exploration); 100 head-to-head games | Fraction of draws | 100% draws (matches CGT prediction of a draw under optimal play) |
| 3. Evaluation against human opponents | Human participants vs. converged TCD policy (tested as Player 1 and Player 2) | Human win rate | 0% human wins across all trials (aligns with RL optimal draw policy and CGT theory) |
| 0.05 | 0.10 | 0.20 | 0.50 | |
|---|---|---|---|---|
| 0.05 | 259,093 (±70,243) | 244,966 (±98,232) | 194,820 (±39,686) | 177,848 (±19,414) |
| 0.10 | 349,362 (±171,793) | 337,401 (±112,093) | 219,956 (±36,977) | 188,379 (±21,125) |
| 0.20 | 637,247 (±131,163) | 359,004 (±54,332) | 188,616 (±18,676) | 205,076 (±32,437) |
| 0.30 | 719,999 (±98,152) | 343,408 (±40,623) | 202,829 (±23,446) | 250,091 (±34,223) |
| Initialization | Mean Episodes | Standard Deviation | Reduction vs. Random |
|---|---|---|---|
| Random initialization (baseline) | 190,087 | 24,216 | |
| TCD heuristic | 146,262 | 32,020 | ≈% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Li, K.; Zhu, W. Combinatorial Game Theory and Reinforcement Learning in Cumulative Tic-Tac-Toe via Evaluation Functions. Stats 2026, 9, 28. https://doi.org/10.3390/stats9020028
Li K, Zhu W. Combinatorial Game Theory and Reinforcement Learning in Cumulative Tic-Tac-Toe via Evaluation Functions. Stats. 2026; 9(2):28. https://doi.org/10.3390/stats9020028
Chicago/Turabian StyleLi, Kai, and Wei Zhu. 2026. "Combinatorial Game Theory and Reinforcement Learning in Cumulative Tic-Tac-Toe via Evaluation Functions" Stats 9, no. 2: 28. https://doi.org/10.3390/stats9020028
APA StyleLi, K., & Zhu, W. (2026). Combinatorial Game Theory and Reinforcement Learning in Cumulative Tic-Tac-Toe via Evaluation Functions. Stats, 9(2), 28. https://doi.org/10.3390/stats9020028


