Skip to Content
ASIApplied System Innovation
  • Article
  • Open Access

26 December 2025

A Hybrid Human-Centric Framework for Discriminating Engine-like from Human-like Chess Play: A Proof-of-Concept Study

and
Department of Computer Science, Caucasus University, 1 Paata Saakadze St, Tbilisi 0102, Georgia
*
Author to whom correspondence should be addressed.

Abstract

The rapid growth of online chess has intensified the challenge of distinguishing engine-assisted from authentic human play, exposing the limitations of existing approaches that rely solely on deterministic evaluation metrics. This study introduces a proof-of-concept hybrid framework for discriminating between engine-like and human-like chess play patterns, integrating Stockfish’s deterministic evaluations with stylometric behavioral features derived from the Maia engine. Key metrics include Centipawn Loss (CPL), Mismatch Move Match Probability (MMMP), and a novel Curvature-Based Stability (ΔS) indicator. These features were incorporated into a convolutional neural network (CNN) classifier and evaluated on a controlled benchmark dataset of 1000 games, where ‘suspicious’ gameplay was algorithmically generated to simulate engine-optimal patterns, while ‘clean’ play was modeled using Maia’s human-like predictions. Results demonstrate the framework’s ability to discriminate between these behavioral archetypes, with the hybrid model achieving a macro F1-score of 0.93, significantly outperforming the Stockfish-only baseline (F1 = 0.87), as validated by McNemar’s test (p = 0.0153). Feature ablation confirmed that Maia-derived features reduced false negatives and improved recall, while ΔS enhanced robustness. This work establishes a methodological foundation for behavioral pattern discrimination in chess, demonstrating the value of combining deterministic and human-centric modeling. Beyond chess, the approach offers a template for behavioral anomaly analysis in cybersecurity, education, and other decision-based domains, with real-world validation on adjudicated misconduct cases identified as the essential next step.

1. Introduction

The COVID-19 pandemic of 2020 triggered a profound transformation in digital practices, including a rapid growth of interest in chess as an online activity [1,2]. With individuals confined to their homes, platforms such as Twitch and YouTube became key vehicles for broadcasting games and attracting new audiences. Events like PogChamps, hosted by popular content creators, further broadened chess’s appeal, particularly among younger and digitally engaged players. This revival amplified the challenges of distinguishing authentic human play from engine-assisted play in online competition. As traditional over-the-board tournaments moved to virtual environments, professional players were compelled to adjust to digital formats [3].
Simultaneously, the widespread accessibility of advanced chess engines revealed critical gaps in existing approaches for analyzing gameplay integrity. The 2022 match between Magnus Carlsen and Hans Niemann, where Niemann’s unexpected win ignited intense debate, highlighted these limitations [4]. Although no definitive evidence of misconduct emerged, the episode underscored both the fragility of trust in online chess and the need for more sophisticated analytical frameworks capable of distinguishing between human-like and engine-like play patterns.
To explore new pathways for strengthening fair-play analysis, this study proposes a proof-of-concept framework designed for discriminating between engine-like and human-like chess play. Current approaches are either proprietary and opaque (e.g., https://www.chess.com/ (accessed on 14 August 2025)) or open frameworks such as Lichess’ Irwin, which primarily rely on centipawn loss (CPL) or move-match probability (MMP) heuristics [5,6,7]. While useful, these heuristics often fail to capture subtler behavioral patterns that distinguish authentic human decisions from algorithmically optimized play. Our methodological exploration integrates deterministic evaluation from Stockfish with behavioral modeling from Maia, a neural network trained on human decision-making. The objective is not to function as a finished discrimination framework, but to operate as an analytical framework that tests the discriminative value of hybrid feature sets in distinguishing play patterns.
By combining technical accuracy metrics with human behavioral tendencies, this approach offers a potential pathway for platforms to better contextualize play, minimize false alarms, and reinforce confidence in online competition. The rest of this article is structured as follows: Section 3 reviews related system-based approaches; Section 4 outlines the architecture and data resources; Section 5 reports evaluation results; Section 6 discusses findings and implications; and Section 7 presents limitations and directions for future development.
Scope and Contribution of This Study: It is essential to clarify the methodological nature of this investigation. The evaluation presented here uses a controlled dataset where ‘suspicious’ behavior is algorithmically generated to simulate engine-optimal patterns, while ‘clean’ play is modeled using Maia’s human-like predictions. This serves as a necessary proof-of-concept benchmark to isolate and measure the discriminative power of proposed hybrid features. Consequently, this work should be interpreted as a methodological investigation into feature engineering for distinguishing behavioral patterns in chess, rather than a validated cheat discrimination framework. Validation on adjudicated, real-world cases represents the critical next step, which we outline in the conclusion.

2. State of the Art

2.1. Centipawn Loss and the Boundaries of Static Metrics

Centipawn Loss (CPL) has long served as a common statistical measure for assessing gameplay integrity. It indicates the extent to which a chosen move diverges from the engine’s optimal recommendation. Although widely adopted, CPL as a stand-alone indicator presents several inherent constraints.
The rapid improvement of chess engines has diminished the separation between top-level human decisions and engine evaluations, reducing the discriminative strength of average CPL values [6]. Many elite players now rely heavily on engines during training, and their high accuracy may represent preparation and study rather than irregular behavior. In addition, purely static CPL values often undervalue stylistic or contextually justified human choices, such as positional sacrifices or long-term strategic imbalances, that diverge from tactical correctness yet remain fully consistent with authentic play.

2.2. Benchmark Datasets and Their Constraints

One of the most frequently employed resources for model training and evaluation in this domain is the Kaggle chess-cheating dataset, curated by Dandoy [7]. This collection includes around 50,000 games categorized as either “suspicious” or “clean,” together with metadata such as player Elo ratings, centipawn loss (CPL) per move, and agreement levels with engine suggestions. It is important to note that these Elo values reflect the Lichess platform’s internal rating system. They are not directly aligned with official FIDE ratings and can vary substantially due to factors such as rating pool differences, inflation, and shifting baselines (e.g., the Sonas adjustment). Consequently, a Lichess score of 2100 may correspond to a considerably lower FIDE rating. In this work, rating categories are therefore treated as relative measures valid only within the Lichess environment. The dataset itself was compiled through thresholds combining high Move Match Probability (MMP) and low CPL values across multiple chess engines.
The scale and openness of this dataset make it attractive for reproducibility and rapid experimentation; however, several drawbacks must be acknowledged. The labeling process is not fully transparent and may reflect confirmation bias. In addition, the absence of contextual features, such as time usage or tournament setting, reduces its value for systems that aim to model authentic play beyond move-level accuracy. Another limitation is that the Kaggle dataset approximates “human” baselines by relying on Maia evaluations at fixed rating bands (≤1900). Although Maia is trained on a large corpus of human games, it remains a model and cannot fully reproduce the subtleties of genuine human decision-making. For this reason, the outcomes of our analysis are best understood as comparisons between Maia-simulated play and Stockfish evaluations, rather than as direct detection of real-world cheating. Expanding validation with datasets of verified human play, such as those utilized in Regan’s work or in case studies of confirmed infractions, represents a critical avenue for future research.

2.3. Stylometric Analysis and Human-Centric Modeling

A major recent development has been the use of stylometric principles, adapted from natural language processing, to capture the stylistic fingerprints of player behavior. Instead of evaluating only tactical correctness, these methods model whether move selections align with patterns characteristic of human decision-making. The Maia engine, introduced by DeepMind, exemplifies this shift. Unlike Stockfish or Lc0, which are designed to maximize strength, Maia predicts the move that a human of a given Elo level is most likely to choose [8]. Its dual-headed residual neural network, trained on millions of human games, learns probability distributions over plausible moves rather than converging solely on the optimal continuation.
This probabilistic framework enables deeper behavioral analysis. Analogies can be drawn with the Giant Language Model Test Room (GLTR), which highlights text that appears “too optimal” or statistically unlikely to be human-generated [9]. In the chess context, stylometric methods similarly evaluate whether move sequences fit expected human distributions, extending beyond traditional optimality-based metrics.

2.4. Prior Research with Maia-Derived Features

Our earlier work combined Maia-derived features, such as the Maia Move Match Probability (MMMP), with Centipawn Loss (CPL) in a convolutional neural network (CNN) framework for gameplay analysis [10]. By integrating these stylometric signals with Stockfish-based indicators, the hybrid model successfully distinguished between natural and engine-influenced play. Experiments yielded approximately 98% accuracy on labeled datasets, outperforming baselines trained exclusively on Stockfish-derived metrics.
These results reinforced the idea that modeling human-likeness provides complementary evidence. Maia’s prediction of behaviorally probable moves, rather than tactically perfect ones, enhances the ability to flag play that is technically accurate but stylistically anomalous for a given profile.

2.5. The Niemann Case and the Fragility of CPL

The 2022 controversy surrounding Grandmaster Hans Niemann brought global attention to the shortcomings of static CPL-based indicators. Niemann’s win over World Champion Magnus Carlsen sparked numerous analyses, such as those by Leite, reporting unusually low CPL averages relative to his rating [11].
While provocative, these results revealed the pitfalls of over-reliance on CPL. Strong theoretical preparation, tactical sharpness, or deep familiarity with specific lines can all produce engine-like statistics without implying misconduct [12,13]. The episode demonstrated that statistical anomalies may raise suspicion but are insufficient for definitive judgment. More interpretable frameworks that combine behavioral modeling with traditional metrics are required to restore trust.

2.6. Platform-Level Discrimination Frameworks: Transparency vs. Rigidity

Commercial platforms illustrate two contrasting approaches. Chess.com employs a proprietary, multi-layered discrimination framework that mixes statistical thresholds, heuristic rules, and manual review. However, its opacity has generated criticism, especially given Kerckhoffs’s principle that systems should remain secure even if their design is public [14].
By contrast, Lichess deploys an open-source model called Irwin, which integrates CPL and MMP signals through a CNN + LSTM neural architecture [15,16,17]. While commendable for transparency, Irwin has been critiqued for limited adaptability. Community reports on moderation inconsistencies [12] and broader calls for nuance [13] underline the trade-off between openness and robustness.

2.7. Toward Greater Statistical Rigor: Curvature-Based Metrics

Extending the statistical frameworks pioneered by Regan [14], recent studies have explored second-order derivatives of CPL to measure what has been termed “curvature-based volatility” [15,18]. These metrics capture the stability of engine evaluations across different search depths, providing another signal for whether a move reflects human reasoning.
Stable recommendations across depths often suggest strategic understanding, whereas sharp volatility may reveal tactical dependence or exploratory probing more typical of engines [17]. Incorporating such dynamic features represents a step toward richer, context-aware detection frameworks that can balance interpretability with statistical rigor.

3. Problem Statement and Hypotheses

This study investigates whether integrating human-oriented evaluation features—specifically Maia’s Mismatch Move Probability (MMMP) and Centipawn Loss (CPL)—into a classification framework enhances the discrimination between engine-like and human-like chess gameplay patterns. Unlike conventional approaches that rely solely on engine-derived accuracy measures (e.g., Stockfish evaluations), this work examines whether augmenting with Maia’s human-likeness signals provides measurable gains in predictive performance.
The guiding research question is:
To what extent do Maia-based MMMP and CPL features, in conjunction with Stockfish evaluations, improve the effectiveness of a convolutional neural network (CNN) classifier in discriminating between human-like and engine-like play patterns?
From this, we derive two hypotheses:
Null Hypothesis (H0).
Incorporating Maia-derived MMMP and CPL features does not produce a statistically significant improvement in classifier performance compared to models based only on Stockfish metrics.
Alternative Hypothesis (H1).
Incorporating Maia-derived MMMP and CPL features yields a statistically significant improvement, indicating that Maia effectively captures deviations from expected human decision-making.
To test these hypotheses, two CNN models are developed:
Full model, incorporating both Stockfish evaluations and Maia-based MMMP and CPL features.
Control model, trained on identical datasets but excluding Maia-derived indicators.
Both models are assessed using standard binary classification metrics:
Accuracy = T P + T N T P + T N + F P + F N
Precision = T P T P + F P ,   Recall = T P T P + F N
F 1 - Score = 2 Precision Recall Precision + Recall
where
  • TP: True Positives
  • TN: True Negatives
  • FP: False Positives
  • FN: False Negatives
To evaluate whether performance improvements are statistically reliable, McNemar’s test is applied to compare outputs of the two models. This test determines whether the observed differences reflect genuine gains rather than random variance.
The anticipated outcome is that the full model, enriched with Maia’s behavioral features, will surpass the control model across all key metrics. This would support the alternative hypothesis, demonstrating that human-centric modeling contributes significantly to the detection of non-natural gameplay. All experiments were conducted with Stockfish v15.0 fixed at depth 20 to ensure consistency, and Maia evaluations were drawn from Maia-1900 (release 2).

Comparative Synthesis and Novelty of the Proposed Framework

To clearly delineate the contribution of this work, Table 1 provides a systematic comparison of the proposed hybrid framework against mainstream discrimination frameworks across several dimensions: core mechanism, key evaluation metrics, inherent strengths, and primary weaknesses.
Table 1. Comparative Analysis of Cheat Detection Approaches in Online Chess.
As synthesized in Table 1, the fundamental distinction of our approach lies in its technical pathway: the hybrid integration of a deterministic evaluator (Stockfish) with a human-centric, behavioral model (Maia). While existing systems operate primarily on a single dimension of “correctness,” our framework introduces a second, orthogonal dimension of “human-likeness.” This dual-perspective allows the system to evaluate not just how good a move is, but how human it is for a given context, thereby addressing the critical challenge of hybrid cheating and strong preparation that can confound single-dimension systems. The proposed ∆S metric further enriches this by quantifying the stability of a player’s decision-making process, a higher-order feature not captured by traditional metrics.

4. Methodology

4.1. Data Sources and Preprocessing

To ensure both scale and diversity, this study utilized gameplay data from two complementary sources. The first consisted of ~50,000 games retrieved via the official Lichess API, focusing on players within the 1900–2100 rating interval. This segment represents a competitive but varied level of play, allowing for the capture of high-quality decisions without overfitting to elite-only behavior. The games, obtained in PGN format, were parsed using custom scripts to extract position-level features.
In parallel, a publicly available Kaggle dataset of historical chess games was incorporated [19,20,21]. While this dataset provides pre-labeled categories (“suspicious” vs. “clean”), all positions were re-evaluated under a unified configuration with Stockfish v15.0 at depth 20. This step enabled the consistent computation of centipawn loss (CPL), mismatch move probability (MMMP), and curvature-based indicators.
To reduce noise from extreme imbalances, we applied a normalization strategy. The scaled centipawn loss (sCPL) for each move was calculated as:
s C P L = e 1 + | v |
where e is the raw CPL and v is the absolute engine evaluation of the best move (in pawns). This scaling mitigates distortions in balanced positions ( | v | 0 ) and avoids inflation in extreme cases. Following established practice, positions with | v | > 3.0 were excluded.
A known limitation of the dataset used for the final evaluation is its scale and composition. The core test set consisted of 1000 games, with all 500 instances labeled as ‘suspicious’ corresponding to gameplay by the Black player. This imbalance was mitigated by training and evaluating separate, color-specific models to prevent the classifier from learning this spurious correlation. Furthermore, the ‘suspicious’ labels were generated based on statistical thresholds (high MMP and low CPL) rather than confirmed real-world cheating incidents. Therefore, the models are evaluated on their ability to distinguish between different behavioral patterns (Maia-like human play vs. Stockfish-like optimal play) within a constrained benchmark.
It is important to clarify the nature of our dataset labels. ‘Suspicious’ indicates gameplay that statistically aligns with Stockfish-optimal patterns (low CPL, high move-match probability), while ‘clean’ indicates gameplay aligned with Maia’s human-modeled predictions. This creates a controlled benchmark for pattern discrimination rather than a dataset of confirmed adjudicated misconduct cases.

4.2. Human-Centric Modeling with Maia

The Maia neural network, a human-play prediction model, was employed as a second evaluation engine. The open-source repository was cloned and configured using 3.14 Python-based tools, with PGN preprocessing facilitated by pgn-extract and conversion scripts. Model training pipelines were adapted from the original Maia implementation, with cleaned games translated into Maia-compatible training data and further refined for experimental purposes.
The environment was deployed on an Ubuntu workstation with CUDA and cuDNN support, supplemented by optimized math libraries (OpenBLAS, OpenCL) and standard build tools. Leela Chess Zero (Lc0 v23) served as the underlying host engine, ensuring compatibility between Maia’s prediction architecture and Stockfish’s evaluation outputs.

4.3. Feature Extraction and Model Training

A hybrid feature set was constructed by combining:
Engine-derived metrics: CPL, curvature-based stability.
Human-likeness metrics: Maia’s MMMP and derived behavioral signals.
The overall pipeline is summarized in Algorithm 1.
Algorithm 1. Hybrid Feature Extraction and Training Framework
Parse PGN games into structured position sequences.
For each position:
 a. Evaluate with Stockfish → record CPL, top-k move scores.
 b. Evaluate with Maia → record MMMP and probability distribution.
Derive stability measures across search depths.
Construct feature tensors of size ( n × d × k ) , where n = positions, d = depth, k = candidate moves.
Train CNN classifiers:
 • Stockfish-only features.
 • Maia-only features.
 • Combined hybrid features.
Evaluate using accuracy, precision, recall, F1-score, and McNemar’s test.
The extraction complexity is O ( n d k ) , with typical parameters n 40 moves/game, d 20 , k 4 . In practice, processing 1000 games required ~4.2 GPU-hours on a single RTX 3080, with model convergence achieved within ~35 min.

4.4. Design of Evaluation Metrics: CPL and MMMP

The evaluation phase combined outputs from Stockfish and Maia to derive two main indicators: Centipawn Loss (CPL) and Mismatch Move Probability (MMMP). To ensure consistency and avoid calibration drift, all runs were fixed to Stockfish v15.0 with a maximum depth of 20 plies. This decision eliminated the version-to-version fluctuations that can arise from later releases (e.g., v15.1) and ensured that performance comparisons between Stockfish-only and hybrid models reflected the feature set rather than changes in engine scaling. Such bounded-depth configurations are also in line with established practice in prior studies [14].
Games were parsed sequentially in PGN format, and CPL was recorded by measuring the deviation between the played move and the engine’s strongest recommendation. To better account for context, CPL values were normalized using the absolute score of the best move and positions with |v| > 3.0 were excluded. This adjustment reduces noise from extreme win/loss scenarios and yields error distributions more representative of realistic human play.
The MMMP measure was defined by whether a player’s move appeared among the engine’s top three suggestions. Moves matching the second or third choice were only credited if they were within 0.50 pawns of the optimal evaluation, preventing inflated alignment from low-quality alternatives. As a result:
MMMP = 1 for exact top-move matches,
MMMP = 2–3 for near-optimal second/third matches, and
MMMP = 4 when the move did not align with any of the top three.
These refinements, value-aware normalization for CPL and proximity-based thresholds for MMMP, produced metrics that more faithfully capture human-like decision patterns. When aggregated across players, the adjusted measures showed stronger correlation with rating strength and provided a more stable foundation for subsequent modeling.

4.5. CPL and MMMP Analysis

The analysis of evaluation metrics was conducted using pandas for data processing and matplotlib for visualization. For each player, average CPL values were computed separately for games played with White and Black pieces, allowing us to assess whether color-specific dynamics influenced error distributions. These averages were then aggregated to produce comparative statistics across both “clean” and “suspicious” categories of gameplay.
To better understand variability, the distribution of CPL values was examined through histograms and density plots. This enabled the exploration of consistency, variance, and skewness across the two groups. In particular, suspicious games were expected to exhibit unusually low dispersion, reflecting unnaturally stable play patterns. By contrast, human play, especially at intermediate rating levels, typically shows broader spread and occasional outliers due to tactical oversights or strategic experimentation.
The mean CPL was employed as a primary proxy for overall gameplay quality, consistent with prior research. However, this value alone can be misleading if not contextualized by position type. For that reason, our analysis was conducted on normalized CPL values (as described earlier), ensuring that trivial positions with extreme | v | values did not distort results.
Parallel to CPL, the MMMP distributions were analyzed at the player level. For each game, the proportion of moves falling into categories 1–4 (top choice, near-optimal second/third, or outside top three) was computed and visualized. This breakdown provided a clearer behavioral profile: legitimate players often fluctuate between categories, while suspicious profiles tend to cluster disproportionately in category 1. The refinement of requiring near-optimality (≤0.50 pawns) for second/third matches ensured that inflated matches were not counted, making MMMP more discriminative.
By combining CPL and MMMP, we obtained a two-dimensional behavioral space. Players with consistently low CPL and high proportions of MMMP = 1 moves fell outside the expected range of natural variance. Scatter plots of CPL against MMMP revealed clear separation between typical player populations and those flagged as suspicious.
Statistical testing was applied to validate these observations. The Kolmogorov–Smirnov test was used to compare distributional differences, while t-tests were employed to determine whether mean CPL values differed significantly between groups. This quantitative validation confirmed that suspicious games displayed statistically distinct patterns from the baseline dataset, reinforcing the value of CPL and MMMP as complementary indicators of gameplay integrity.

4.6. Curvature-Based Stability (∆S)

The Curvature-Based Stability (ΔS) metric is theoretically grounded in the cognitive differences between human and engine decision-making [22,23]. Human players typically employ progressive deepening, a cognitive process where evaluations stabilize as positions are considered more deeply, reflecting strategic understanding and pattern recognition. In contrast, chess engines may exhibit evaluation volatility, where scores fluctuate significantly across search depths as new tactical possibilities are discovered. This volatility often indicates brute-force calculation rather than conceptual understanding [24,25].
The ΔS metric quantifies this stability by measuring the second derivative of evaluation scores across increasing engine depths. Formally, it captures the acceleration of score changes: low ΔS values indicate smooth, convergent evaluation patterns characteristic of human-like reasoning, while high ΔS values suggest the volatile, exploratory patterns typical of engine calculation. This provides a mathematical basis for distinguishing between understanding-driven and calculation-driven decision processes.
Each MMMP dataset was structured as a 3D tensor:
M R d × m × r ,
where:
  • d = engine depth
  • m = number of moves per game
  • r { 1 , 2 , 3 , 4 } = ranking of the move
To derive a scalar stability signature S, the following steps were applied:
1.
First derivative with respect to depth, capturing how prediction probabilities change across increasing depths:
Δ M d , m , r = M d + 1 , m , r M d , m , r ,
2.
Second derivative (curvature), indicating the acceleration or volatility of those changes:
Δ 2 M d , m , r = M d + 1 , m , r 2 M d , m , r + M d 1 , m , r ,
3.
Aggregated scalar S , computed by averaging the squared second derivatives over all valid indices:
S = 1 ( D 2 ) M R d 2 D 1 m 1 M r = 1 R Δ 2 M d , m , r 2 ,
The constructed tensor stores raw engine evaluations (in pawns) for the top four legal moves at each search depth. Raw values were retained instead of probabilities to allow consistent derivation of CPL, scaling by | v | , and curvature-based measures. To ensure stability, the set of candidate moves was fixed at the maximum search depth and tracked across shallower depths, even if their relative ranking shifted. This design prevents instability that would occur if the top-k set were redefined at each depth, and enables calculation of curvature ( S ) and stability metrics that capture the evolution of evaluations for the same moves across depths.
This structure enhances interpretability: smoother evaluation trajectories across depths suggest greater consistency and are more likely to align with human-like reasoning. As an initial validation, we compared S values against CPL variance in matched game segments (Table 2). The results showed that positions with lower S exhibited systematically lower CPL volatility, indicating that prediction stability correlates with more natural error distributions.
Table 2. Relationship between ∆S and CPL variance. Positions with lower ∆S exhibit reduced CPL volatility, suggesting greater stability in evaluation curves and stronger alignment with human-like decision patterns.
While these findings provide encouraging evidence, they remain preliminary. In the current study, S is introduced as a conceptual auxiliary feature rather than a fully validated metric. Its theoretical motivation is strong, serving as a measure of prediction volatility where lower values reflect smoother transitions across depths, but empirical testing remains limited. Future research will extend this work by isolating S ’s contribution and evaluating its combined effect with CPL and MMMP across larger datasets and rating bins.

4.7. CNN Modeling

The classification framework was implemented in TensorFlow/Keras and trained on Google Colab. Preprocessed data was uploaded as CSV files, cleaned of missing entries, and standardized with StandardScaler(). The resulting feature sets were reshaped into 3D tensors of the form (samples, features, 1).
The CNN architecture followed a hierarchical design optimized for sequential pattern recognition. It consisted of:
Three Conv1D layers (filters: 64, 128, 256)
BatchNormalization for stable convergence
MaxPooling1D to reduce dimensionality
Dropout layers for regularization
Flatten + Dense (128) for feature integration
Sigmoid output layer for binary classification
Models were trained for 50 epochs with a 20% validation split. To capture potential asymmetries in decision-making, two separate classifiers were developed for games played with White and Black pieces.

4.8. Statistical Validation Framework

To ensure the robustness of our findings given the dataset size, we employed multiple statistical validation techniques:
K-fold Cross-Validation: We performed 5-fold cross-validation on the entire dataset to assess model stability and reduce variance in performance estimates.
Confidence Intervals: We calculated 95% confidence intervals for all performance metrics using bootstrapping with 1000 resamples.
Multiple Hypothesis Testing: For the McNemar’s test, we report both the standard p-value and a bootstrapped p-value to verify the result’s stability.
These measures provide a more comprehensive understanding of model performance beyond single train-test split metrics.

4.9. Predictive Inference and Control Setup

Inference was conducted by thresholding classifier outputs at 0.5, designating predictions above this value as “engine-assisted.” Beyond simple binary labels, the raw probability distributions were retained, enabling the calculation of confidence intervals and per-game reliability scores. By aggregating variance across individual moves, the framework reports not only a binary outcome but also a measure of prediction certainty—an important feature for reducing false positives, particularly in low-cheating environments such as over-the-board tournaments.
A control model, trained without Maia-derived features, provided a comparative baseline. This allowed direct statistical evaluation of Maia’s contribution to overall performance. Evaluation metrics included accuracy, precision, recall, F1-score, and confusion matrices.
For reproducibility, the main algorithmic pipeline can be summarized as:
Parsing PGN files.
Computing Stockfish and Maia evaluations across multiple depths.
Deriving CPL, MMMP, and ∆S features.
Structuring extracted features into tensor format.
Training CNN models with and without Maia inputs.
Evaluating performance using standard classification metrics.
Runtime profiling on a single NVIDIA RTX 3080 GPU indicated that feature extraction for 1000 games required approximately 4.2 GPU-hours, while CNN training converged within ~35 min per model. While computationally heavier than CPL-only systems, the framework remains feasible for research-scale studies and demonstrates the scalability of incorporating human-centric features into cheat detection pipelines.

4.10. Comparison with Established Baseline Methods

To contextualize our results within the existing literature, we define a comparison framework against established baseline methods. While direct performance comparison is limited by dataset differences (our synthetic benchmark vs. real-world datasets used in other studies), we establish methodological comparisons:
Lichess Irwin Baseline: The open-source Irwin system [16] serves as a key benchmark, employing CNN + LSTM architectures on CPL and MMP features. For comparison, we would require: (1) access to identical real-world datasets with adjudicated labels, (2) re-implementation or API access to Irwin’s detection pipeline, and (3) evaluation using consistent metrics (F1-score, precision-recall curves).
Statistical Baselines (Regan’s Method): Regan’s contextual statistical modeling [14] provides an interpretable, player-specific baseline focusing on CPL variance and game phase analysis. Comparison would involve implementing Regan’s ELO-modeled expected move distributions and evaluating them on the same position sequences.
Proprietary Systems (https://www.chess.com/ (accessed on 14 August 2025)): While direct comparison is not feasible due to opacity, we acknowledge that commercial systems integrate additional signals (mouse movements, timing data) not captured in our move-only analysis.
Current Limitation and Future Comparison Plan: Our study’s use of a synthetic benchmark prevents direct numerical comparison with these systems at present. In future work, we will (1) obtain adjudicated datasets used in prior studies where possible, (2) implement the Irwin architecture on our feature set for controlled comparison, and (3) report relative improvement over these baselines on shared test data. This represents an essential step for establishing the practical utility of our hybrid approach beyond controlled pattern discrimination.

5. Results

5.1. Post-Processed Data Pattern Analysis

To evaluate the informativeness of Maia-derived features for classification, an exploratory analysis was carried out on both Centipawn Loss (CPL) and Match Move Probability (MMMP) values. These indicators were extracted from evaluations produced by Stockfish and Maia across a set of online chess games.
The analysis compared engine-derived metrics across two categories of games: those played by regular human players and those suspected of involving external assistance. The objective was to determine whether the selected features exhibit class-separable patterns, thereby confirming that they contain signals relevant to gameplay authenticity prior to classifier training.

5.1.1. CPL Patterns in Human Games

In games considered to be free of external support, Maia consistently reported lower scaled CPL values than Stockfish. As shown in Figure 1, this outcome suggests that Maia, trained on large datasets of human games, evaluates positions in a way that is more consistent with human decision-making and regards common human moves as relatively optimal.
Figure 1. Average CPL values reported by Maia and Stockfish for White and Black moves in games assumed to be played without engine assistance.
Stockfish, in contrast, applies stricter criteria and penalizes the same moves more heavily due to its objective assessment of tactical accuracy. This difference highlights Maia’s potential value as a behavioral benchmark, providing insights into what can be regarded as natural or expected within human strategic reasoning.

5.1.2. CPL Patterns in Engine-Optimized Play

For games exhibiting engine-optimal patterns (Figure 2), the pattern shifts when using the scaled metric. In this case, Stockfish reports lower average sCPL values than Maia for both White and Black moves. This indicates that the played moves align more closely with Stockfish’s optimal recommendations and deviate from the human-style predictions generated by Maia. Although scaling and the exclusion of positions with ∣v∣ > 3.0 reduce extreme values, the separation between the two evaluators remains clear.
Figure 2. Average CPL values from Maia and Stockfish for White and Black moves in games exhibiting engine-optimal statistical patterns.
This reversal is highly suggestive. While the moves appear strong under Stockfish’s assessment, their divergence from Maia’s human-centered model points to behavior that may not be authentic. The result reinforces the value of Maia as a behavioral reference point, capable of highlighting unnatural gameplay patterns that remain undetected under purely technical evaluation criteria.

5.1.3. Comparative ΔS Analysis: Human vs. Engine-Optimized Patterns

A rigorous statistical analysis of the ΔS distributions provides strong quantitative evidence that human decisions align more closely with Maia’s modeling than with Stockfish’s optimal evaluations, while engine-optimized patterns exhibit fundamentally different characteristics (Figure 3 and Figure 4). The ΔS values for human games exhibit a distinct distribution with normalized mean ΔS of 0.969 ± 0.022 (mean ± standard deviation), interquartile range (IQR) of 0.014, coefficient of variation of 2.27%, and 95% confidence interval of [0.965, 0.973]. The values show a strongly positive skew, with 99% of observations falling above zero, indicating tight clustering around the mean. In contrast, engine-optimized patterns display mathematical perfection with ΔS ≡ 1.000 ± 0.000, zero variance, range = 0, IQR = 0, and 100% of values at maximum ΔS.
Figure 3. Difference in Black move accuracy between Stockfish and Maia across games with odd digit-sum indices. Higher values reflect greater divergence from Maia’s human-aligned predictions.
Figure 4. Accuracy difference for Black moves between Stockfish and Maia in suspected engine-assisted games. Positive values indicate stronger alignment with Stockfish, while negative values indicate divergence from Maia’s human-trained model.
The near-universal positivity of human ΔS values (99% > 0) quantitatively demonstrates that Maia provides more stable and consistent prediction landscapes across search depths compared to Stockfish. The exceptionally low standard deviation (±0.022) and narrow IQR (0.014) indicate that this pattern is remarkably consistent across human games, suggesting a fundamental property of human decision-making rather than random variation. This statistical profile directly supports the conclusion that human decisions align more closely with Maia’s modeling (Figure 3). The stability of Maia’s evaluations across depths (low ΔS volatility) reflects the characteristic pattern of human reasoning, where moves are selected based on strategic understanding rather than deep tactical calculation.
Conversely, the complete concentration at ΔS = 1.000 for engine-optimized patterns demonstrates that Stockfish produces perfectly stable and consistent predictions across all search depths for these games (Figure 4). This mathematical perfection is behaviorally anomalous—human decision-making inherently contains subtle variations and inconsistencies that manifest as natural variance in ΔS values. The absolute consistency observed here is statistically improbable for human cognition and strongly indicates algorithmically generated moves. While Stockfish evaluates these moves as stable and optimal across all depths, Maia—trained on human behavioral patterns—correctly identifies them as statistically atypical. The perfect ΔS stability (1.000 ± 0.000) represents a mathematical fingerprint of engine assistance, fundamentally different from the natural variance observed in human games (0.969 ± 0.022).
The binary separation between these distributions (human ΔS ≈ 0.97 with natural variance vs. engine-optimized ΔS ≡ 1.00 with zero variance) provides a powerful quantitative basis for pattern discrimination. This clear statistical dichotomy reinforces Maia’s role as a behavioral baseline capable of detecting deviations from human decision-making norms that remain invisible to traditional optimality-based metrics.

5.1.4. Empirical Validation of ΔS Metric

To empirically validate the ΔS metric’s discriminative power, we conducted additional correlation analysis between ΔS values and established behavioral indicators. The results demonstrate ΔS’s utility as a complementary signal:
Strong negative correlation with human-likeness scores: ΔS vs. Maia alignment: r = −0.78, p < 0.001
Moderate positive correlation with engine optimality: ΔS vs. Stockfish CPL: r = 0.62, p < 0.001
Distinct distribution patterns: Human games showed ΔS clustering (μ = 0.969, σ = 0.022) while suspected engine games exhibited perfect stability (ΔS ≡ 1.000)
These empirical patterns confirm ΔS’s theoretical foundation: human-like play demonstrates natural evaluation stability, while engine-assisted play shows either perfect stability (following optimal lines) or abnormal volatility (exploratory calculation).

5.2. Model Evaluation

To determine the impact of Maia-derived features on pattern discrimination, two convolutional neural network (CNN) models were developed and tested. The control model relied exclusively on Stockfish-based indicators, while the experimental model incorporated both Stockfish metrics and additional features derived from Maia. Each model was trained and evaluated separately for White and Black positions in order to avoid color-related bias and to ensure balanced assessment.

5.2.1. Experimental Model Performance

  • The hybrid model that included Maia’s behavioral features demonstrated strong predictive capability.
  • White-side task: During early epochs, the network showed signs of underfitting, but by the end of training (epoch 50) it reached a validation accuracy of 99.37% with a loss of 0.0127. Training accuracy remained consistently high, confirming that the model achieved a stable fit without severe overfitting. Figure 5 presents the accuracy and loss curves, illustrating rapid convergence and strong learning stability.
    Figure 5. Training and validation accuracy (left) and loss (right) for the White-side experimental model across 50 epochs.
  • Black-side task: Training progressed more gradually, with accuracy stabilizing at about 93% after 120 epochs. Although the validation loss fluctuated during the initial phase, it settled into a more stable pattern over time, indicating robust generalization even in the more complex detection setting (Figure 6).
    Figure 6. Training and validation accuracy (left) and loss (right) curves for the Black-side experimental model over 120 epochs.
  • Test evaluation:
  • White-side classification produced perfect scores (accuracy = 1.00, F1-score = 1.00 for class 0). However, this result reflects the absence of positive cases in the White-side test data.
  • Black-side classification achieved accuracy = 0.93 with precision of 0.95 (class 0) and 0.91 (class 1), recall of 0.91 (class 0) and 0.95 (class 1), and a macro-averaged F1-score = 0.93.
  • These findings confirm that the expanded feature space, which integrates Maia’s human-aligned modeling, improves performance and helps capture subtle behavioral differences not visible with Stockfish alone.
The confusion matrix and additional performance visualizations are included in Figure 5 and Figure 6, providing further insight into classification distribution and model reliability.

5.2.2. Control Model Performance

The baseline model, trained exclusively on Stockfish-derived features, converged more quickly but lacked the robustness of the experimental version.
White-side task: Accuracy reached 100% by epoch 30, as shown in Figure 7. However, this apparent success is misleading, since all White-side games in the dataset were labeled as non-cheating, creating a trivial classification task.
Figure 7. Training and validation accuracy (left) and loss (right) curves for the White-side control model over 50 epochs.
Black-side task: Convergence occurred earlier than in the experimental model, but validation loss fluctuated more strongly across epochs, suggesting weaker generalization and mild overfitting (Figure 8).
Figure 8. Training and validation accuracy (left) and loss (right) curves for the Black-side control model over 120 epochs.
The evaluation is tested as follows:
Black-side task: accuracy = 0.875 with precision of 0.89 (class 0) and 0.87 (class 1), recall of 0.86 (class 0) and 0.89 (class 1), and macro-averaged F1-score = 0.87.
The control model underperformed compared to the experimental model, particularly in terms of precision, recall, and stability. This demonstrates that Maia-derived features enhance discrimination power and reduce false predictions.

5.2.3. Comparative Summary and Hypothesis Testing

“Both models achieved perfect results on White-side detection (accuracy = 1.00, F1-score = 1.00). However, this outcome is a direct artifact of the dataset’s fundamental limitation: all labeled cheating instances occurred in Black-side games. This made White-side classification a trivial task. Therefore, all subsequent analysis and the primary conclusions of this study are based exclusively on the Black-side detection task. This provides a meaningful, non-trivial evaluation where the model must distinguish behavioral patterns based on move features alone, without color-based shortcuts. The promising performance on this task demonstrates the potential of the hybrid feature set, though its generalizability to a balanced, real-world dataset requires future validation. For this reason, White-side results are not used to assess generalizability, and performance comparisons rely exclusively on the non-trivial Black-side task.
Nonetheless, Black-side detection remains the more reliable measure of performance in this study, as it reflects a non-trivial classification challenge. A summary of dataset composition and experimental configurations is provided in Table 3.
Table 3. Model Comparison: Impact of Maia Features on Discriminating Engine-Optimized Play Patterns (Black-Side).
The integration of Maia-derived features produced clear benefits, including higher F1-scores, a reduction in false negatives, and faster as well as smoother convergence during training. To evaluate whether these improvements were statistically significant, a McNemar’s test was applied, comparing the predictions of the control and experimental models on the identical Black-side test set.
From the paired predictions, the following outcomes were observed:
Cases where the control model was correct and the experimental model was incorrect.
Cases where the control model was incorrect and the experimental model was correct.
Using McNemar’s test with continuity correction, the difference between the two models was tested for statistical significance.
χ 2 = ( | b c | 1 ) 2 b + c = ( | 3 14 | 1 ) 2 3 + 14 = 100 17 5.88 ,
The McNemar’s test produced a p-value of approximately 0.0153. However, to address concerns about statistical robustness with our sample size, we supplemented this primary analysis with multiple validation approaches as outlined in Section 4.8.
The Cross-Validation Results are as follows:
  • 5-fold cross-validation mean F1-scores: Experimental Model = 0.925 ± 0.018, Control Model = 0.862 ± 0.022
  • The hybrid model showed consistent improvement across all folds (minimum ΔF1 = 0.051, maximum ΔF1 = 0.067)
The Confidence Intervals for Key Metrics look as follows:
  • F1-score improvement: 0.060 [95% CI: 0.032, 0.088]
  • Accuracy improvement: 0.055 [95% CI: 0.025, 0.085]
  • Recall improvement: 0.050 [95% CI: 0.018, 0.082]
While the standard McNemar’s test produced χ2 = 5.88 (p = 0.0153), we verified this result with:
  • Bootstrapped p-value: 0.0168 [95% CI: 0.012, 0.024]
  • Exact binomial test: p = 0.0133
The consistency across these statistical approaches strengthens the evidence for Maia’s contribution, though we acknowledge the remaining uncertainty due to sample size constraints. As a result, the null hypothesis is rejected, and the alternative hypothesis is accepted.
To further investigate the contribution of individual features, a feature ablation analysis was conducted. Table 4 presents the comparative results across different feature sets. Stockfish-derived CPL alone provided a solid baseline; however, the addition of Maia-based features (CPL and MMMP) substantially improved recall, capturing deviations from typical human decision patterns. The ∆S metric on its own demonstrated weaker predictive power, but when combined with Stockfish and Maia features, it enhanced overall robustness.
Table 4. Feature ablation study comparing different feature sets for discriminating engine-optimized play patterns (Black-side).
The complete hybrid model consistently achieved the highest performance across all evaluation metrics, confirming that deterministic indicators (engine-based) and behavior-al signals (human-centered) operate as complementary sources of information.

5.2.4. Comprehensive Model Validation

To provide a complete picture of model performance and stability, we conducted additional analyses beyond the primary train-test split. It is illustrated in Table 5.
Table 5. Comprehensive Performance Comparison with Confidence Intervals.
The experimental model demonstrated superior stability across validation methods:
Lower variance in cross-validation results (σ2 = 0.00032 vs. 0.00048 for control)
Tighter confidence intervals across all metrics
Consistent ranking in all 5 cross-validation folds
While the absolute performance metrics should be interpreted in the context of the synthetic dataset, the relative improvement and statistical consistency provide compelling evidence for the hybrid approach’s value.

6. Discussions

6.1. Key Findings

This proof-of-concept study demonstrates that integrating deterministic Stockfish evaluations with Maia’s human-centered stylometric modeling yields clear improvements in discriminating between engine-like and human-like play patterns in a controlled setting. Centipawn Loss (CPL) remains a strong marker of engine assistance, while Mean Move Matching Probability (MMMP) captures stylistic deviations from human decision-making. When applied together, these features complement one another by reducing false positives and false negatives, leading to more robust predictions.
The addition of the Curvature-Based Stability (ΔS) metric further enriched the feature space by quantifying fluctuations in decision consistency across games. While ΔS is still conceptual in this version, preliminary results show alignment with behavioral stability, suggesting it may be a useful auxiliary signal. Overall, the hybrid approach provides preliminary support for the central hypothesis: combining behavioral and deterministic features shows promise for improving upon single-engine detection approaches in controlled settings.

6.2. Interpretation of Feature Patterns

Exploratory analysis of CPL and MMMP values revealed distinct class-separable behavior. In human-only games, Maia reported smoother and lower-scaled CPL values, reflecting its training on natural move distributions. Stockfish, in contrast, penalized these same moves more heavily, emphasizing objective accuracy. In engine-optimized play games, the relationship reversed: moves aligned closely with Stockfish’s evaluations but diverged sharply from Maia’s human-modeled predictions.
This divergence highlights the value of Maia’s features as behavioral baselines. MMMP distributions quantified volatility across search depths, showing that genuine human games followed more stable prediction curves, while engine-optimized play pattern games exhibited sharper deviations. Such stylistic contrasts cannot be fully captured by engine evaluations alone, reinforcing the importance of a dual-feature framework.
The empirical performance of the ΔS metric aligns with its theoretical foundation in cognitive decision-making. While its standalone predictive power was modest (F1 = 0.81), its contribution to the hybrid model demonstrates its value as a complementary signal. ΔS appears to capture higher-order reasoning patterns that are orthogonal to traditional accuracy metrics, specifically the stability of evaluation across computational depths, a characteristic that distinguishes human conceptual understanding from engine calculation.

6.3. Model Performance and Statistical Validation

The experimental model, trained on combined Stockfish and Maia-derived features, achieved superior performance compared to the control model. Validation accuracy reached 99.4% for the full model versus 93.9% for the control. Importantly, improvements were most pronounced on the Black-side task, where suspected cheating was concentrated.
Statistical testing confirmed this advantage was robust across multiple validation approaches. The primary McNemar’s test comparing predictions across the same test set yielded a p-value of 0.0153, which was corroborated by bootstrapping (p = 0.0168) and exact binomial testing (p = 0.0133). The 95% confidence interval for the F1-score improvement [0.032, 0.088] excludes zero, providing formal evidence that Maia-derived features significantly enhance classification accuracy.
Cross-validation further demonstrated the stability of this improvement, with the experimental model consistently outperforming the control across all five folds (mean ΔF1 = 0.063 ± 0.008). This consistency across multiple statistical approaches strengthens the evidence for the hybrid framework’s value.
Feature ablation further supported this conclusion: while Stockfish’s CPL provided a strong baseline, the addition of Maia’s CPL and MMMP markedly improved recall, and ΔS contributed incremental robustness. The combination consistently outperformed any individual feature.

6.4. Limitations and Constraints

This study, while demonstrating promising discriminative capabilities, operates within specific methodological boundaries that must be clearly acknowledged to properly contextualize its findings.
  • Controlled Experimental Design vs. Real-World Detection: The most significant boundary of this work is its reliance on algorithmically generated ‘suspicious’ labels rather than confirmed cases of cheating. Our dataset creates a clean separation between Maia-modeled human-like patterns and Stockfish-optimized patterns for controlled experimentation. Consequently, the performance metrics (F1 = 0.93) reflect pattern discrimination capability between behavioral archetypes in a synthetic environment, not validated cheating detection accuracy in competitive play. This distinction is fundamental: we have demonstrated that hybrid features can distinguish between two behavioral extremes, but not that they can identify actual misconduct involving sophisticated, hybrid human-engine interaction.
  • Structural Artifacts in Dataset Construction: All ‘suspicious’ instances were algorithmically associated with Black pieces, creating a trivial classification task for White-side games. While we mitigated this through color-specific modeling, this artificial imbalance limits conclusions about general discrimination capability and introduces the risk of model artifacts. A balanced benchmark with equal distribution across colors and game phases would provide a more robust evaluation.
  • Scale and Composition Limitations: The dataset size (1000 games) and composition impose constraints on statistical confidence. While our validation framework employed multiple robustness checks (cross-validation, bootstrapping, McNemar’s test), the confidence intervals remain relatively wide, reflecting uncertainty inherent in smaller samples. Additionally, the rating band (1900–2100 Lichess) may not generalize to other skill levels where behavioral patterns differ substantially.
  • Computational Scalability Considerations: The dual-engine evaluation pipeline (4.2 GPU-hours for 1000 games) presents practical challenges for real-time, large-scale deployment. While this is primarily an implementation concern rather than a methodological flaw, it highlights a trade-off between analytical depth and operational feasibility that future implementations must address.
  • Engine Version Specificity: Fixing Stockfish v15.0 and Maia-1900 ensures experimental consistency but limits immediate generalization to other engine versions or Maia models trained on different rating bands. Engine evaluations evolve with updates, and human behavioral models vary across skill levels.
  • Absence of Contextual Features: The framework currently analyzes moves in isolation without incorporating game context (time controls, tournament settings, player history) or temporal features (move timing patterns). Real-world discrimination frameworks typically integrate such contextual signals to reduce false positives and model authentic decision-making more holistically.
  • Statistical Power and Effect Size: While our statistical validation demonstrates significant improvement (p = 0.0153), the effect size and practical significance in real-world scenarios remain unvalidated. The controlled nature of our experiment likely amplifies discriminative differences compared to the noisier, more subtle patterns of actual cheating.
Positioning Statement: These constraints collectively emphasize that this work represents a methodological proof-of-concept establishing that hybrid features provide discriminative signals in controlled settings. The findings should be interpreted as evidence of analytical potential rather than operational capability. The essential next step—validation on adjudicated real-world cases—will determine whether this potential translates to practical utility for fair-play assurance systems.
To prevent misinterpretation of our results, Table 6 clearly distinguishes what this study demonstrates versus what requires future validation.
Table 6. Scope clarification: Methodological contribution vs. requirements for applied detection.
This clarification emphasizes that our contribution is methodological: we demonstrate that hybrid features provide discriminative signals in controlled settings. The essential next step—validation on adjudicated real-world cases—will determine whether this methodological advantage translates to practical utility.

6.5. Relation to Previous Research

Earlier work using simplified CNN architectures and datasets of blatant engine use showed improvements of over 15% when Maia features were included. This study extends that line of research by testing against subtler, mixed-move cheating patterns, offering a more realistic benchmark. Despite the greater difficulty, the experimental model still delivered statistically significant improvements. These results position Maia-derived stylometric features as a valuable complement to existing detection frameworks, bridging the gap between technical correctness and human-likeness in play.

7. Conclusions and Future Work

This research introduced and methodologically evaluated a proof-of-concept hybrid framework for discriminating between engine-like and human-like chess play patterns. We emphasize that this constitutes a methodological foundation for behavioral pattern analysis in chess, establishing that hybrid feature engineering can effectively distinguish between different gameplay archetypes in controlled settings. By integrating deterministic Stockfish evaluations with human-centered modeling from Maia, the framework captures both technical accuracy and stylistic deviations. The experimental results demonstrate that Maia-derived features significantly improve predictive performance on our synthetic benchmark, with the hybrid model achieving a macro F1-score of 0.93 compared to 0.87 for the Stockfish-only baseline.
The essential next step is transitioning from methodological proof-of-concept to validated detection capability. While our synthetic dataset provided necessary control for feature evaluation, validation on real-world adjudicated cases represents the critical pathway forward. To this end, we are pursuing access to Lichess’s publicly available moderation logs and API endpoints containing platform-adjudicated fair-play violations. This real-world dataset will enable direct testing of our framework’s generalization from synthetic pattern discrimination to actual misconduct detection. With access to adjudicated cases, we will implement comparative analysis against established baselines (Lichess Irwin, Regan’s statistical methods) on shared test data to establish practical utility.
Building upon our methodological foundation, future work will construct balanced benchmarks where anomalous patterns are equally distributed across White and Black pieces, integrate contextual signals like move timing and game context, and develop explainable AI components to help moderators understand classification decisions. Beyond chess, the proposed methodology of combining deterministic evaluation with behavioral stylometry offers a promising template for anomaly detection in other decision-based domains like education, finance, and cybersecurity, though similar validation pathways would be required in each application domain.
In summary, this work establishes a methodological framework demonstrating the discriminative value of hybrid features for chess gameplay analysis. The results provide a foundation for future research, with clear pathways outlined for validation, comparison, and enhancement toward practical fair-play assurance systems.

Author Contributions

Conceptualization, Z.K. and M.I.; methodology, M.I. and Z.K.; software, Z.K.; validation, M.I.; formal analysis, M.I. and Z.K.; investigation, Z.K. and M.I.; resources, Z.K.; data curation, M.I. and Z.K.; writing—original draft preparation, Z.K.; writing—review and editing, M.I.; visualization, Z.K.; supervision, M.I.; project administration, M.I. and Z.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

https://www.kaggle.com/datasets (accessed on 14 August 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Majhi, S.G. ‘A Brave New World’: Exploring the Implications of Online Chess for the Sport Post the Pandemic. In Sports Management in an Uncertain Environment; Springer Nature: Singapore, 2023; pp. 255–270. [Google Scholar]
  2. Wen, Y.-C. Secondary Analysis of Interviews About the Factors Driving the Membership Growth of Chess.com. J. Bus. Adm. Lang. 2024, 12, 75–95. [Google Scholar]
  3. Laarhoven, T.; Ponukumati, A. Towards transparent cheat detection in online chess: An application of human and computer decision-making preferences. In Proceedings of the International Conference on Computers and Games, Virtually, 22–24 November 2022; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
  4. Gajjar, V. The Biggest Cheating Scandal in the History of Chess: Carlsen v. Niemann-A Case Note. Glob. Sports Pol’y Rev. 2023, 3, 66. [Google Scholar]
  5. Obradović, P.; Mišić, M. Network Dynamics of the Online Chess Platform Lichess: A Social Network Analysis Case Study; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
  6. Bruvschessmedia. The Use of Engines, Average Centipawn Loss and Online Cheating. 2018. Available online: https://bruvschessmedia.com/the-use-of-engines-average-centipawn-loss-and-online-cheating/ (accessed on 14 August 2025).
  7. Dandoy, B. Chess Cheating Dataset. Kaggle. 2021. Available online: https://www.kaggle.com/datasets/brieucdandoy/chess-cheating-dataset (accessed on 14 August 2025).
  8. Zaiden, F.F. A Descorporização do Ator: Do Capitalismo à Inteligência Artificial. Master’s Thesis, Universidade NOVA de Lisboa, Lisbon, Portugal, 2024. [Google Scholar]
  9. GLTR Team. Harvard NLP + MIT-IBM Watson AI Lab. Giant Language Model Test Room. 2019. Available online: https://gltr.io (accessed on 14 August 2025).
  10. Iavich, M.; Kevanishvili, Z. Detecting Fair Play Violations in Chess Using Neural Networks. In IVUS2024: Information Society and University Studies, Proceedings of the 29th International Conference Information Society and University Studies, Kaunas, Lithuania, 17 May 2024; CEUR Workshop Proceedings; CEUR: Aachen, Germany, 2024; Volume 3885. [Google Scholar]
  11. Carta, A. The Hans Niemann Case: Numbers—What They Reveal and What They Do Not Reveal. ChessBase. Available online: https://en.chessbase.com/post/the-hans-niemann-case-numbers-what-they-reveal-and-what-they-do-not-reveal (accessed on 14 October 2022).
  12. Lichess Forum. Analysis of Lichess Pattern Discrimination with Machine Learning (ML): A Misuse of ML. 2023. Available online: https://lichess.org/forum/general-chess-discussion/analysis-of-lichess-cheating-detection-with-machine-learning-ml-a-mis-use-of-ml--doesnt-work (accessed on 14 August 2025).
  13. Lichess Feedback. Is an Average Centipawn Loss an Indicator of a Player Cheating? 2024. Available online: https://lichess.org/forum/lichess-feedback/is-an-average-centipawn-loss-an-indicator-of-a-player-cheating (accessed on 7 July 2025).
  14. Regan, K. Pattern Discrimination and Cognitive Modeling at Chess. University at Buffalo. 2023. Available online: https://cse.buffalo.edu/~regan/chess/ (accessed on 15 August 2025).
  15. Leite, R.V.; de Oliveira, A.V.C. Expected human performance behavior in chess using Centipawn loss analysis. In Proceedings of the International Conference on Human-Computer Interaction, Copenhagen, Denmark, 23–28 July 2023; Springer Nature: Cham, Switzerland, 2023. [Google Scholar]
  16. Thoresen, T. The Irwin Pattern Discrimination System. Lichess GitHub Wiki. 2021. Available online: https://github.com/clarkerubber/irwin (accessed on 14 August 2025).
  17. Iavich, M.; Kevanishvili, Z. A Neural Network Approach to Chess Cheat Detection. In International Conference on Information and Software Technologies; Springer Nature: Cham, Switzerland, 2024. [Google Scholar]
  18. Lipton, R.J. Should These Quantities Be Linear? Gödel’s Lost Letter and P = NP. Available online: https://rjlipton.com/2023/08/04/should-these-quantities-be-linear/ (accessed on 4 August 2023).
  19. Quaranta, L.; Calefato, F.; Lanubile, F. KGTorrent: A dataset of python jupyter notebooks from kaggle. In Proceedings of the 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), Madrid, Spain, 17–19 May 2021; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
  20. Sultan, A.B.A.; Abu-Naser, S.S. Predictive Modeling of Breast Cancer Diagnosis Using Neural Networks: A Kaggle Dataset Analysis. Int. J. Acad. Eng. Res. 2023, 7, 1–9. [Google Scholar]
  21. Ghahfarokhi, M.M.; Asgari, A.; Abolnejadian, M.; Heydarnoori, A. DistilKaggle: A distilled dataset of Kaggle Jupyter notebooks. In Proceedings of the 21st International Conference on Mining Software Repositories, Lisbon, Portugal, 15–16 April 2024. [Google Scholar]
  22. Marquez-Neila, P.; Baumela, L.; Alvarez, L. A morphological approach to curvature-based evolution of curves and surfaces. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 2–17. [Google Scholar] [CrossRef] [PubMed]
  23. Arefi, A.; Nahvi, H. Stability analysis of an embedded single-walled carbon nanotube with small initial curvature based on nonlocal theory. Mech. Adv. Mater. Struct. 2017, 24, 962–970. [Google Scholar] [CrossRef]
  24. Alexandre, D.-J.; Bertails-Descoubes, F.; Thollot, J. Stable inverse dynamic curves. ACM Trans. Graph. 2010, 29, 1–10. [Google Scholar] [CrossRef]
  25. Nguyen, D.V.; Nguyen, A.T.; Nguyen, M.H.; Nguyen, L.Q.; Jiang, S.; Fetaya, E.; Tran, L.D.; Chechik, G.; Nguyen, T.M. Expert Merging in Sparse Mixture of Experts with Nash Bargaining. arXiv 2025, arXiv:2510.16138. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.