A Hybrid Human-Centric Framework for Discriminating Engine-like from Human-like Chess Play: A Proof-of-Concept Study
Abstract
1. Introduction
2. State of the Art
2.1. Centipawn Loss and the Boundaries of Static Metrics
2.2. Benchmark Datasets and Their Constraints
2.3. Stylometric Analysis and Human-Centric Modeling
2.4. Prior Research with Maia-Derived Features
2.5. The Niemann Case and the Fragility of CPL
2.6. Platform-Level Discrimination Frameworks: Transparency vs. Rigidity
2.7. Toward Greater Statistical Rigor: Curvature-Based Metrics
3. Problem Statement and Hypotheses
- TP: True Positives
- TN: True Negatives
- FP: False Positives
- FN: False Negatives
Comparative Synthesis and Novelty of the Proposed Framework
4. Methodology
4.1. Data Sources and Preprocessing
4.2. Human-Centric Modeling with Maia
4.3. Feature Extraction and Model Training
| Algorithm 1. Hybrid Feature Extraction and Training Framework |
| Parse PGN games into structured position sequences. For each position: a. Evaluate with Stockfish → record CPL, top-k move scores. b. Evaluate with Maia → record MMMP and probability distribution. Derive stability measures across search depths. Construct feature tensors of size , where = positions, = depth, = candidate moves. Train CNN classifiers: • Stockfish-only features. • Maia-only features. • Combined hybrid features. Evaluate using accuracy, precision, recall, F1-score, and McNemar’s test. |
4.4. Design of Evaluation Metrics: CPL and MMMP
4.5. CPL and MMMP Analysis
4.6. Curvature-Based Stability (∆S)
- engine depth
- number of moves per game
- ranking of the move
- 1.
- First derivative with respect to depth, capturing how prediction probabilities change across increasing depths:
- 2.
- Second derivative (curvature), indicating the acceleration or volatility of those changes:
- 3.
- Aggregated scalar , computed by averaging the squared second derivatives over all valid indices:
4.7. CNN Modeling
4.8. Statistical Validation Framework
4.9. Predictive Inference and Control Setup
4.10. Comparison with Established Baseline Methods
5. Results
5.1. Post-Processed Data Pattern Analysis
5.1.1. CPL Patterns in Human Games
5.1.2. CPL Patterns in Engine-Optimized Play
5.1.3. Comparative ΔS Analysis: Human vs. Engine-Optimized Patterns
5.1.4. Empirical Validation of ΔS Metric
5.2. Model Evaluation
5.2.1. Experimental Model Performance
- The hybrid model that included Maia’s behavioral features demonstrated strong predictive capability.
- White-side task: During early epochs, the network showed signs of underfitting, but by the end of training (epoch 50) it reached a validation accuracy of 99.37% with a loss of 0.0127. Training accuracy remained consistently high, confirming that the model achieved a stable fit without severe overfitting. Figure 5 presents the accuracy and loss curves, illustrating rapid convergence and strong learning stability.
- Black-side task: Training progressed more gradually, with accuracy stabilizing at about 93% after 120 epochs. Although the validation loss fluctuated during the initial phase, it settled into a more stable pattern over time, indicating robust generalization even in the more complex detection setting (Figure 6).
- Test evaluation:
- White-side classification produced perfect scores (accuracy = 1.00, F1-score = 1.00 for class 0). However, this result reflects the absence of positive cases in the White-side test data.
- Black-side classification achieved accuracy = 0.93 with precision of 0.95 (class 0) and 0.91 (class 1), recall of 0.91 (class 0) and 0.95 (class 1), and a macro-averaged F1-score = 0.93.
- These findings confirm that the expanded feature space, which integrates Maia’s human-aligned modeling, improves performance and helps capture subtle behavioral differences not visible with Stockfish alone.
5.2.2. Control Model Performance
5.2.3. Comparative Summary and Hypothesis Testing
- 5-fold cross-validation mean F1-scores: Experimental Model = 0.925 ± 0.018, Control Model = 0.862 ± 0.022
- The hybrid model showed consistent improvement across all folds (minimum ΔF1 = 0.051, maximum ΔF1 = 0.067)
- F1-score improvement: 0.060 [95% CI: 0.032, 0.088]
- Accuracy improvement: 0.055 [95% CI: 0.025, 0.085]
- Recall improvement: 0.050 [95% CI: 0.018, 0.082]
- Bootstrapped p-value: 0.0168 [95% CI: 0.012, 0.024]
- Exact binomial test: p = 0.0133
5.2.4. Comprehensive Model Validation
6. Discussions
6.1. Key Findings
6.2. Interpretation of Feature Patterns
6.3. Model Performance and Statistical Validation
6.4. Limitations and Constraints
- Controlled Experimental Design vs. Real-World Detection: The most significant boundary of this work is its reliance on algorithmically generated ‘suspicious’ labels rather than confirmed cases of cheating. Our dataset creates a clean separation between Maia-modeled human-like patterns and Stockfish-optimized patterns for controlled experimentation. Consequently, the performance metrics (F1 = 0.93) reflect pattern discrimination capability between behavioral archetypes in a synthetic environment, not validated cheating detection accuracy in competitive play. This distinction is fundamental: we have demonstrated that hybrid features can distinguish between two behavioral extremes, but not that they can identify actual misconduct involving sophisticated, hybrid human-engine interaction.
- Structural Artifacts in Dataset Construction: All ‘suspicious’ instances were algorithmically associated with Black pieces, creating a trivial classification task for White-side games. While we mitigated this through color-specific modeling, this artificial imbalance limits conclusions about general discrimination capability and introduces the risk of model artifacts. A balanced benchmark with equal distribution across colors and game phases would provide a more robust evaluation.
- Scale and Composition Limitations: The dataset size (1000 games) and composition impose constraints on statistical confidence. While our validation framework employed multiple robustness checks (cross-validation, bootstrapping, McNemar’s test), the confidence intervals remain relatively wide, reflecting uncertainty inherent in smaller samples. Additionally, the rating band (1900–2100 Lichess) may not generalize to other skill levels where behavioral patterns differ substantially.
- Computational Scalability Considerations: The dual-engine evaluation pipeline (4.2 GPU-hours for 1000 games) presents practical challenges for real-time, large-scale deployment. While this is primarily an implementation concern rather than a methodological flaw, it highlights a trade-off between analytical depth and operational feasibility that future implementations must address.
- Engine Version Specificity: Fixing Stockfish v15.0 and Maia-1900 ensures experimental consistency but limits immediate generalization to other engine versions or Maia models trained on different rating bands. Engine evaluations evolve with updates, and human behavioral models vary across skill levels.
- Absence of Contextual Features: The framework currently analyzes moves in isolation without incorporating game context (time controls, tournament settings, player history) or temporal features (move timing patterns). Real-world discrimination frameworks typically integrate such contextual signals to reduce false positives and model authentic decision-making more holistically.
- Statistical Power and Effect Size: While our statistical validation demonstrates significant improvement (p = 0.0153), the effect size and practical significance in real-world scenarios remain unvalidated. The controlled nature of our experiment likely amplifies discriminative differences compared to the noisier, more subtle patterns of actual cheating.
6.5. Relation to Previous Research
7. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Majhi, S.G. ‘A Brave New World’: Exploring the Implications of Online Chess for the Sport Post the Pandemic. In Sports Management in an Uncertain Environment; Springer Nature: Singapore, 2023; pp. 255–270. [Google Scholar]
- Wen, Y.-C. Secondary Analysis of Interviews About the Factors Driving the Membership Growth of Chess.com. J. Bus. Adm. Lang. 2024, 12, 75–95. [Google Scholar]
- Laarhoven, T.; Ponukumati, A. Towards transparent cheat detection in online chess: An application of human and computer decision-making preferences. In Proceedings of the International Conference on Computers and Games, Virtually, 22–24 November 2022; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
- Gajjar, V. The Biggest Cheating Scandal in the History of Chess: Carlsen v. Niemann-A Case Note. Glob. Sports Pol’y Rev. 2023, 3, 66. [Google Scholar]
- Obradović, P.; Mišić, M. Network Dynamics of the Online Chess Platform Lichess: A Social Network Analysis Case Study; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
- Bruvschessmedia. The Use of Engines, Average Centipawn Loss and Online Cheating. 2018. Available online: https://bruvschessmedia.com/the-use-of-engines-average-centipawn-loss-and-online-cheating/ (accessed on 14 August 2025).
- Dandoy, B. Chess Cheating Dataset. Kaggle. 2021. Available online: https://www.kaggle.com/datasets/brieucdandoy/chess-cheating-dataset (accessed on 14 August 2025).
- Zaiden, F.F. A Descorporização do Ator: Do Capitalismo à Inteligência Artificial. Master’s Thesis, Universidade NOVA de Lisboa, Lisbon, Portugal, 2024. [Google Scholar]
- GLTR Team. Harvard NLP + MIT-IBM Watson AI Lab. Giant Language Model Test Room. 2019. Available online: https://gltr.io (accessed on 14 August 2025).
- Iavich, M.; Kevanishvili, Z. Detecting Fair Play Violations in Chess Using Neural Networks. In IVUS2024: Information Society and University Studies, Proceedings of the 29th International Conference Information Society and University Studies, Kaunas, Lithuania, 17 May 2024; CEUR Workshop Proceedings; CEUR: Aachen, Germany, 2024; Volume 3885. [Google Scholar]
- Carta, A. The Hans Niemann Case: Numbers—What They Reveal and What They Do Not Reveal. ChessBase. Available online: https://en.chessbase.com/post/the-hans-niemann-case-numbers-what-they-reveal-and-what-they-do-not-reveal (accessed on 14 October 2022).
- Lichess Forum. Analysis of Lichess Pattern Discrimination with Machine Learning (ML): A Misuse of ML. 2023. Available online: https://lichess.org/forum/general-chess-discussion/analysis-of-lichess-cheating-detection-with-machine-learning-ml-a-mis-use-of-ml--doesnt-work (accessed on 14 August 2025).
- Lichess Feedback. Is an Average Centipawn Loss an Indicator of a Player Cheating? 2024. Available online: https://lichess.org/forum/lichess-feedback/is-an-average-centipawn-loss-an-indicator-of-a-player-cheating (accessed on 7 July 2025).
- Regan, K. Pattern Discrimination and Cognitive Modeling at Chess. University at Buffalo. 2023. Available online: https://cse.buffalo.edu/~regan/chess/ (accessed on 15 August 2025).
- Leite, R.V.; de Oliveira, A.V.C. Expected human performance behavior in chess using Centipawn loss analysis. In Proceedings of the International Conference on Human-Computer Interaction, Copenhagen, Denmark, 23–28 July 2023; Springer Nature: Cham, Switzerland, 2023. [Google Scholar]
- Thoresen, T. The Irwin Pattern Discrimination System. Lichess GitHub Wiki. 2021. Available online: https://github.com/clarkerubber/irwin (accessed on 14 August 2025).
- Iavich, M.; Kevanishvili, Z. A Neural Network Approach to Chess Cheat Detection. In International Conference on Information and Software Technologies; Springer Nature: Cham, Switzerland, 2024. [Google Scholar]
- Lipton, R.J. Should These Quantities Be Linear? Gödel’s Lost Letter and P = NP. Available online: https://rjlipton.com/2023/08/04/should-these-quantities-be-linear/ (accessed on 4 August 2023).
- Quaranta, L.; Calefato, F.; Lanubile, F. KGTorrent: A dataset of python jupyter notebooks from kaggle. In Proceedings of the 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), Madrid, Spain, 17–19 May 2021; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
- Sultan, A.B.A.; Abu-Naser, S.S. Predictive Modeling of Breast Cancer Diagnosis Using Neural Networks: A Kaggle Dataset Analysis. Int. J. Acad. Eng. Res. 2023, 7, 1–9. [Google Scholar]
- Ghahfarokhi, M.M.; Asgari, A.; Abolnejadian, M.; Heydarnoori, A. DistilKaggle: A distilled dataset of Kaggle Jupyter notebooks. In Proceedings of the 21st International Conference on Mining Software Repositories, Lisbon, Portugal, 15–16 April 2024. [Google Scholar]
- Marquez-Neila, P.; Baumela, L.; Alvarez, L. A morphological approach to curvature-based evolution of curves and surfaces. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 2–17. [Google Scholar] [CrossRef] [PubMed]
- Arefi, A.; Nahvi, H. Stability analysis of an embedded single-walled carbon nanotube with small initial curvature based on nonlocal theory. Mech. Adv. Mater. Struct. 2017, 24, 962–970. [Google Scholar] [CrossRef]
- Alexandre, D.-J.; Bertails-Descoubes, F.; Thollot, J. Stable inverse dynamic curves. ACM Trans. Graph. 2010, 29, 1–10. [Google Scholar] [CrossRef]
- Nguyen, D.V.; Nguyen, A.T.; Nguyen, M.H.; Nguyen, L.Q.; Jiang, S.; Fetaya, E.; Tran, L.D.; Chechik, G.; Nguyen, T.M. Expert Merging in Sparse Mixture of Experts with Nash Bargaining. arXiv 2025, arXiv:2510.16138. [Google Scholar] [CrossRef]








| System/Approach | Core Mechanism | Key Metrics | Strengths | Weaknesses | Novelty of Proposed Framework |
|---|---|---|---|---|---|
| Proprietary Systems (e.g., chess.com) | Multi-layered Heuristics and Manual Review | CPL, Behavioral Patterns (opaque) | High accuracy, Integrates non-moving data (e.g., mouse movements) | “Black box” lack of transparency violates Kerckhoffs’s principle | Provides a transparent, explainable, and reproducible methodology. |
| Open-Source Systems (e.g., Lichess Irwin) | Neural Network (CNN + LSTM) on Engine Metrics | CPL, Move Match Probability (MMP) | Transparent, community-vetted, strong baseline | Models optimality but lacks an explicit model of human-likeness | Explicitly integrates a human behavioral model (Maia) via MMMP, adding a stylometric dimension. |
| Statistical Frameworks (e.g., Regan’s) | Contextual Statistical Modeling | Player-specific CPL variance, Game Phase | Highly interpretable, models individual baselines | Less suited for real-time, automated platform-level detection | Introduces a data-driven, deep learning-based classifier for pattern recognition at scale, augmented by a novel stability metric (∆S). |
| Proposed Hybrid Framework | Deterministic Evaluation + Behavioral Stylometry | CPL, MMMP, ∆S | Interpretable, robust, reduces false positives by discerning “strong” from “unnaturally strong” play | Computationally intensive due to dual-engine evaluation | N/A |
| Game Segment Type | Avg. ∆S | CPL Variance (σ2) | Interpretation |
|---|---|---|---|
| Human Games | 0.972 | 184.3 | Stable, human-like consistency |
| Engine-optimized play Games | 1.004 | 312.7 | Volatile, engine-driven deviations |
| Model Type | Black Accuracy | Macro F1 | Validation Loss | Notes |
|---|---|---|---|---|
| Control Model | 87.5% | 0.87 | Fluctuating | Stockfish only |
| Experimental Model | 93.0% | 0.93 | Lower, stable | Stockfish + Maia features |
| Feature Set | Accuracy | Precision | Recall | Macro F1 | Notes |
|---|---|---|---|---|---|
| Stockfish only (CPL) | 87.5% | 0.89 | 0.86 | 0.87 | Strong baseline, lacks behavioral modeling |
| Maia only (CPL + MMMP) | 90.2% | 0.91 | 0.90 | 0.90 | Captures human-like deviations |
| ∆S only | 81.7% | 0.83 | 0.79 | 0.81 | Weak standalone, signals stability patterns |
| Stockfish + Maia | 92.6% | 0.94 | 0.91 | 0.92 | Complementary interaction |
| Stockfish + Maia + ∆S | 93.0% | 0.95 | 0.91 | 0.93 | Best performance, full hybrid |
| Model | Accuracy | Precision | Recall | F1-Score | Cross-Val F1 |
|---|---|---|---|---|---|
| Control | 0.875 [0.842, 0.908] | 0.880 [0.843, 0.917] | 0.875 [0.838, 0.912] | 0.870 [0.832, 0.908] | 0.862 ± 0.022 |
| Experimental | 0.930 [0.905, 0.955] | 0.930 [0.902, 0.958] | 0.930 [0.902, 0.958] | 0.930 [0.902, 0.958] | 0.925 ± 0.018 |
| Aspect | This Proof-of-Concept Study | Required for Applied Detection |
|---|---|---|
| Dataset | Synthetic benchmark with algorithmic labels for controlled pattern discrimination | Adjudicated real-world cheating cases with platform confirmation |
| Validation | Feature discriminability between behavioral archetypes (Maia-human vs. Stockfish-optimal) | Generalization to actual cheating in competitive play |
| Performance Claims | Pattern discrimination accuracy on controlled benchmark (F1 = 0.93) | Cheating detection accuracy on real-world data |
| Comparative Evidence | Improvement over Stockfish-only baseline in controlled setting | Superiority over established baselines (Irwin, Regan) on shared real data |
| Contribution | Methodological framework demonstrating hybrid feature value | Deployable system with proven operational utility |
| Next Step | Establishes foundation requiring the validation pathway in Section 7 | Implementation of said validation pathway |
| Aspect | This Proof-of-Concept Study | Required for Applied Detection |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Published by MDPI on behalf of the International Institute of Knowledge Innovation and Invention. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Kevanishvili, Z.; Iavich, M. A Hybrid Human-Centric Framework for Discriminating Engine-like from Human-like Chess Play: A Proof-of-Concept Study. Appl. Syst. Innov. 2026, 9, 11. https://doi.org/10.3390/asi9010011
Kevanishvili Z, Iavich M. A Hybrid Human-Centric Framework for Discriminating Engine-like from Human-like Chess Play: A Proof-of-Concept Study. Applied System Innovation. 2026; 9(1):11. https://doi.org/10.3390/asi9010011
Chicago/Turabian StyleKevanishvili, Zura, and Maksim Iavich. 2026. "A Hybrid Human-Centric Framework for Discriminating Engine-like from Human-like Chess Play: A Proof-of-Concept Study" Applied System Innovation 9, no. 1: 11. https://doi.org/10.3390/asi9010011
APA StyleKevanishvili, Z., & Iavich, M. (2026). A Hybrid Human-Centric Framework for Discriminating Engine-like from Human-like Chess Play: A Proof-of-Concept Study. Applied System Innovation, 9(1), 11. https://doi.org/10.3390/asi9010011

