1. Introduction
The strength of Large Language Models (LLMs) stems from their capacity to capture complex semantic relationships within massive datasets. While these models have achieved remarkable benchmarks across diverse domains [
1], they remain compromised by persistent reliability issues, such as hallucinations [
2]. This fragility is an expected byproduct of the nature of natural language: it is a dynamic, culturally dependent medium defined by polysemy and contextual nuance, where truth and syntactic validity are often decoupled. To isolate the mechanisms of model reasoning from these linguistic ambiguities, this study pivots from natural language to the structured domain of strategic decision-making. By analyzing LLMs within formal, rule-bound environments, we can more effectively interrogate the formation and editability of their internal world models.
The choice of chess as a medium for this exploration is deliberate. Chess, with its well-defined rules, total observability, and large (but finite) set of discrete states, provides a structured framework to probe the inner workings of an AI system. Applying language models to chess allows us to observe how these models navigate a domain characterized by strict logical rules and objective outcomes, despite being designed to process natural language. This juxtaposition of a language model’s inherent capabilities with the structured environment of chess offers a powerful lens through which we examine the model’s decision-making process, its approach to problem-solving, and its ability to track the world state. The goal is to move beyond treating the language model as a “black box” and towards a more detailed understanding that can explain why the model makes decisions based on the internal mechanics of its operations.
In this work, we construct a 12-layer chess-playing Generative Pre-trained Transformer (GPT) and train probing classifiers to classify piece and color for each square on a chess board based on the hidden state activations from the residual stream for each layer. Chess serves as a valuable testbed because it provides access to a fully observable, objective world state—an advantage that is difficult or sometimes impossible to obtain in natural language domains. This allows us to rigorously evaluate whether a representation edit is correct, not merely plausible. Although UCI transcripts constitute a highly structured micro-language with a constrained vocabulary and rigid syntax, the underlying model remains a sequence-to-sequence transformer, and many of the mechanisms we analyze—the accumulation of state across layers, decomposition of the residual stream, editing interventions—are not specific to chess.
This article makes several research contributions to the field of mechanistic interpretability. First, our work has generated several valuable tools for studying chess-playing transformers. We developed a chess-playing GPT that improves upon the legal move rate of previous model. Our statistical analysis shows that the model’s residual stream is path independent, confirming the model’s Markovian representation and enabling time-invariant editing for identical board positions. We also introduce a new metric, legal move probability mass (LMPM), to quantitatively assess the performance of edit interventions. The source code, GPT model, training data, probe classifiers, and SAEs are available online at
https://github.com/austinleedavis/icmla-2024 (accessed on 20 February 2026).
Second, this article presents a comprehensive study of the editing performance of different probes when modifying the board state. Our results show that linear probes trained on the original residual stream decisively outperform probes based on Sparse Autoencoders. Our linear probes have the benefit of being compositionally interpretable; the weights of a probe trained to jointly classify piece type and color are approximated by the sum of separate probes for type and color.
This article extends research that was previously published at [
3].
Section 2 provides an overview of related work in the area. Our methodology is described in
Section 3, and we present our results in
Section 4. Our article concludes with a summary of our findings (
Section 5).
3. Method
This section describes the methods and processes used to conduct our representation editing experiments. In
Section 3.1, we introduce the GPT we use throughout the remainder of the paper.
Section 3.3 describes our linear probe classifiers.
Section 3.2 describes the Sparse Autoencoders trained on the hidden state vectors of our GPT.
Section 3.5 explains the technique for applying intervention vectors to the hidden state of the GPT during a forward pass.
Section 3.6 outlines our experimental design for performing interventions on the GPT with our probe classifiers, including the positions and values that are written to the hidden state of the GPT. This section also formally defines the metric used to grade the semantic validity of the outputs post-intervention. See
Appendix B for additional background on our GPT including tokenization and training parameters.
3.1. The Chess-Playing GPT
Since our focus was to study language models in a more controlled setting, we trained a 12-layer GPT-2 exclusively on tokenized move sequences rather than using a purpose-built model like AlphaZero [
35] or Leela Chess Zero [
36] or NLP models such as OpenAI’s original GPT-2 [
37] or ChessGPT [
38]. In particular, our model was given no a priori knowledge about the game of chess; it was trained only to perform next-token prediction on the input move sequences.
The training dataset was collected from the Lichess.org Open Database, specifically using the 94 million games played in June 2023. A holdout set of the 120,000 games played in 2013 was used for evaluation. These datasets include all rated bullet (1 min) and blitz (3 to 5 min) games and tournaments played by human and bot players with Elo ratings ranging from 700 to over 3000. The training and evaluation sets were deduplicated to ensure there was no cross-over, and the raw game records were converted to Universal Chess Interface (UCI) notation using
pgn-extract [
39].
3.2. Sparse Autoencoders
Sparse Autoencoders (SAEs) seek to disentangle the polysemantic activations in the hidden state vectors of our GPT into monosemantic features, enabling more interpretable reading and more precise control of the model’s internal world model. The prototypical SAE consists of a parameterized single-layer autoencoder
defined by
Training consists of reconstructing a large dataset of model activations and enforcing sparsity with a
-scaled
loss penalty as in
However, this loss function introduces a shrinkage bias toward reconstructions with smaller norms so that for a decoder with fixed weights, SAEs will sacrifice reconstruction accuracy in a trade-off to reduce the
loss, even when perfect reconstruction is possible. Instead, we use a Gated SAE architecture [
23] which solves the shrinkage issue by including a gated encoder:
where
Here,
is the (pointwise) Heaviside step function and ⊙ denotes elementwise multiplication. During training, we update the loss function by restricting the sparsity loss to the gated activations (
) and including
to allow gradients to backpropogate to
and
during training:
We trained one gated SAE for each layer of the GPT. Each SAE was trained independently on residual stream activations extracted from a fixed layer of the GPT model. The input to each SAE was the residual stream vector at that layer, with dimensionality , and the latent space was set to = 12,288—a expansion ratio.
This expansion factor was selected after comparing SAEs with ratios , , , and . Lower ratios saturated their capacity by early layers and produced denser codes, higher L1 penalties, and noticeably poorer reconstruction performance. In contrast, the 16× model retained substantial unused capacity in shallow layers yet delivered consistently better reconstruction fidelity and reduced model degradation, particularly in deeper layers where additional dimensionality proved beneficial.
The training dataset was created following a similar procedure as used to create the training dataset for the GPT itself; we trained the SAEs on 1B tokens from the 100M chess games from the January 2023 shard of the Lichess.org Open Database. We adopted a streaming setup in which token sequences were batched and encoded into hidden states on-the-fly using the frozen GPT model. Residual stream activations at the target layer were extracted and used as inputs to the SAE.
All models were trained using the Adam optimizer with , , and a linear warmup followed by cosine decay schedule. Training was conducted for a fixed number of steps sufficient to ensure convergence as monitored using reconstruction loss and explained variance metrics.
3.3. Linear Probe Classifiers
Probe classifiers [
17,
20] map the hidden state of the neural network to some relevant feature of the input and have become a common tool used by the interpretability community. Linear probes are favored because they have very low representational power and can only represent linear relationships. So, a linear probe can only predict a non-linear feature of the inputs if the model first transforms it into a linear representation within its activations [
29].
We trained a linear model, , our “Probe”, to classify board position from the GPT’s intermediate hidden states. Board position refers to the arrangement of all chess pieces on the board at a specific moment in the game. As is customary, we denote piece type using a single letter (p-pawn, n-knight, b-bishop, r-rook, q-queen, k-king) and piece color using capitalization (upper-case for white and lower-case for black). We use the Ø symbol when a square is unoccupied (empty). Thus, when token t is being processed by our GPT, the board position is fully specified by the list , where for .
The probe training data is a set
of hidden state vectors cached from forward passes of the GPT. We constructed this dataset by caching
at the final five indices
of each game trace and each layer
ℓ. We chose five to ensure we could sample hidden states across each phase of a complete chess move (see
Appendix A for details on move phases,
), and since game traces vary in length and are approximately normally distributed, this approach allowed us to sample hidden states from varying depths in the game tree in proportion to the number of times that depth is reached. The probe was trained for fifteen epochs over the resulting cache of hidden states.
3.4. SAE-Based Probes
We considered two techniques to generate hidden state probes based on the SAEs. The first technique (
Figure 1, top) trains a linear probe classifier on the SAE latents
, for
x drawn from an activation dataset
. Once training is complete, the weights of this probe are decoded by the SAE so their shape matches the residual stream of the GPT. This SAE-latent linear probe serves as a direct test of whether SAE features can be mapped to board-state variables via supervised readout—i.e., whether the SAE latents preserve sufficient semantic coherence to support precise edits. Our results indicate that this assumption does not hold.
The second technique (
Figure 1, bottom) uses the well-established contrastive difference approach [
26], a common baseline for constructing SAE-based intervention vectors. Activations in
are first partitioned into two sets,
where property
is active (true) and
where property
is inactive (false). The contrastive difference probe at layer
ℓ for property
has weight matrix
defined as the difference of means between the feature representations of the active and inactive partitions:
3.5. Editing the Hidden State of the GPT
If one considers a probe to be a decoder of the hidden state, our interventions aim to reverse the process, writing state information back to the GPT latent space in a way that causally affects the GPT output. We accomplish this by adding an intervention vector to the hidden state during a forward pass using the
NNsight [
40] library, although several other libraries (e.g., [
41,
42]) support the necessary hooks into the forward pass.
Regardless of how the intervention vectors are chosen, the process for applying the intervention remains the same. First, we select an intervention vector
for each position
i and each layer
ℓ. (Our results discuss how the choice of
i and
ℓ affect the intervention.) Finally, we modify the GPT hidden state at each intervention position
i and layer
ℓ to be as follows:
where
The denominator of Equation (
1) scales the intervention vector to unit length, and the numerator scales it back up to match the magnitude of the hidden state vector post-LayerNorm. We set
based on empirical tests (
Figure 2) where smaller values caused the interventions to have little/no effect on the output, and substantially larger values (e.g.,
) caused the GPT to output nonsense.
3.6. Experimental Design
To measure the efficacy of our probe-based interventions, we perform two forward passes with the GPT for each game trace so that we can compare the outputs pre- and post-intervention. We call the first forward pass baseline since we only read values from the GPT during this pass. We call the second forward pass intervention because that is the pass where we perform the interventions.
The baseline forward pass allows us to construct many intervention vectors and generate data we use as a control for our experiments. During this pass, we cache the following values for each token position to capture data from the games’ opening and mid-game moves:
: the GPT’s output for up to token ,
: the hidden state vectors six tokens beyond
: the board square selected by the model from the sub-game up to token
: the piece type and piece color positioned on
We consider five types of intervention vectors when intervening a hidden state .
Random: is a normally distributed random vector
Patch: is hidden state vector from the baseline pass offset by six tokens
Probe: , the negation of the th column vector of the Probe weight matrix
SAE-Probe: where the probe is trained on the hidden states encoded by the SAE
CD-Probe: where weights are taken from the SAE-based contrastive difference (CD) probe
With the cache from the baseline pass, we create a test case for each subgame trace
for all
. In each test case, we consider three board positions. For notation purposes, we denote by
the original board position that comes directly from the sequence of moves in the input trace up to token
i. We denote by
the target board position that occurs when piece
is removed from square
on the original board. And we denote by
the future board position that occurs six tokens after
in the game trace. To ensure all board positions are valid, we discard test cases in the rare instances where either
is a non-square token (e.g.,
eos) or
or else when
. An example of these three boards for a single game trace is shown in
Figure 3.
During the second intervention forward passes, we perform the interventions and calculate the resulting legal move rate. Legal move rate is a measure of the proportion of the GPT probability mass that is allotted to valid tokens.
Formally, let
be the set of all
words in the vocabulary. Given a board position
B,
can be partitioned into two sets,
and
, such that
and
. Let
be the set of
grammatically correct tokens that if selected form part of a move that is legal according to the board position
B. Then
is the legal move probability mass for board position
B.
Notably, the legal move probability mass depends on the board B. During our experiments, we grade all interventions according to the original board, . Furthermore, we evaluate the probe-based intervention against the target board and the patch intervention against the future board .
4. Results
Here we present the results from our analysis of the GPT and Probe interventions.
Section 4.1 describes the performance of our GPT in making legal chess moves.
Section 4.2 quantifies the reconstruction error of our Sparse Autoencoders.
Section 4.3 and
Section 4.4 present the results from our experiments on the classification performance of linear and SAE-based probes. In
Section 4.5, we show that our GPT hidden state vectors encode a Markovian world model of the board position.
Section 4.6 is a brief case study on the effects of a single intervention; we establish a causal link between the linear representations learned by our probe and the output of the model.
Section 4.7 and
Section 4.8 examine intervention success across nearly 500k samples as measured by legal move probability mass, with the aim of understanding how well the post-intervention output obeys the rules of chess. We compare the intervention performance of probes based on the original activations vs. probes trained on SAE latents.
4.1. Chess GPT Evaluation
The performance of our chess-playing GPT is compared to three other models in
Table 2. Since LCB [
30] uses a compatible tokenization scheme, a comparison of the next token prediction metrics (e.g., perplexity) are valid. However, the two GPTs from [
31] use a single-character tokenization scheme and a different notation style. So, we only report the legal move rates of the GPT for these models.
We found that our GPT recommended legal moves 99.9% of the time across the evaluation dataset. We investigated the instances where illegal moves were recommended. The preponderance of these errors occurred immediately after pawn promotions. We believe this is due to the rarity of pawn promotions. A pawn promotion occurs in only one third of all games, and when it does, it is represented by a single token; on average a promotion token occurs approximately every one-thousand tokens. This is similar to the challenge of handling rare words in the NLP context. Still, the overall performance of our GPT was suitable for our experiments.
4.2. Sparse Autoencoder Reconstruction Performance
We evaluated each trained SAE using several key metrics: the cross-entropy loss of the GPT model with and without replacing the residual stream with SAE outputs, two sparsity measures (
and
norms), and the reconstruction quality of the SAE based according to the Mean Squared Error (MSE) and the cosine similarity versus the input. These metrics are summarized in
Table 3.
4.3. Sparse Autoencoder Probe Classification Performance
SAE-based probe classification performance is summarized in
Table 4 (all metrics use weighted averaging across the 13-classes). The SAE-based probes performed best on layer 10, slightly later than the linear probes which were trained directly from the hidden state of the GPT. Unfortunately, the SAE-based probes performed much worse than the linear probes in classifying the board state. At best, they only matched the board reconstruction performance of the layer-three linear probe, but in most cases, they performed worse than the linear probe trained exclusively on the GPT token and positional embeddings.
4.4. Linear Probe Classification Performance
The layer-by-layer performance of the linear probe classifier is shown in
Table 5, where metrics indicate probe performance on the task of classifying board position from the hidden state vectors.
Classification performance peaks between layers seven and eight; this will become relevant later. The high values indicate the model successfully transforms the non-linear inputs into a linear representation of the board state, and simple linear models can decode these representations reliably. Still, high classification performance does not establish a causal link between these representations and the output of the model.
4.5. Markovian World Model
We examined whether our GPT learned to represent the arrangement of pieces on the board in a way that respects the Markovian property of chess. Formally, let
be a board position,
a UCI string,
:
the hidden state from layer
ℓ, and Board:
the function that maps a UCI string to its resulting board state. We define an equivalence relation ∼ over
by
Then the partition
partitions
U into equivalence classes where each class
corresponds to a unique board position.
We test the similarity of the GPT’s hidden states for differing move sequences from the same equivalence class. Formally, the hidden state for layer
ℓ respects the Markovian property if
:
We computed the board fen representation for the final tokens from 200 k chess move sequences and selected the classes where
. We constructed a dataset
, where each
is the set of terminal hidden state vectors when the sequences
are passed through the GPT.
We then performed a permutation test with 1000 bootstrap samples to assess cosine similarity (alignment) within the sets
compared to the global population for each layer
ℓ. Specifically, assuming the cosine similarity across
follows a probability distribution
, and the cosine similarity from the general population of hidden state vectors follows a probability distribution
, we tested
For each layer, we computed the test statistic:
where there are
n equivalence classes and
is the expected cosine similarity of the wider population at layer
ℓ. The empirical
p-value
p is based on the proportion of bootstraps
. The results of our tests are summarized in
Table 6.
All layers yielded an empirical p-value of , indicating a statistically significant increase in internal alignment of the equivalence classes compared to what would be expected by chance. If we apply the Holm correction for multiple comparison test, that is, we sort the p-values and multiply them by their rank to get the new p-value, the largest new p value would be , meaning our results are robust to p-hacking. The difference between and the mean of the null distribution grows consistently with depth, and the signal becomes increasingly distinguishable as the variance of the null distribution shrinks. This supports the hypothesis that deeper layers in the model more strongly encode semantic consistency via directional alignment in representation space.
To quantify the strength of alignment in equivalence classes compared to the general population of representations, we computed Cohen’s
d at each layer of the model. Cohen’s
d is a standardized effect size defined as
where
is the mean cosine similarity within equivalence classes,
is the average cosine similarity between random vector pairs from the general population, and
is the standard deviation of those population similarities. This metric measures how many standard deviations the semantic set similarities exceed the baseline alignment expected by chance.
As shown in the table of results, we observe a progressive increase in effect size across layers. In early layers, d values are modest (∼0.5 to ), suggesting weak to moderate differences. However, as we move deeper into the model, the effect size grows dramatically—surpassing in some of the highest layers. This indicates that representations of semantically grouped inputs become increasingly aligned in later layers.
4.6. Example Case Study
We present a simple demonstration that probe-based interventions can stimulate the GPT to produce legal moves from a target board even when multiple pieces and squares are intervened simultaneously (see
Figure 4). From the move sequence
e2e4 d7d5 (the Scandinavian Defense), our GPT recommends white’s pawn on
e4 capture black’s pawn on
d5 with 75.8% of the probability mass. We construct an intervention vector
from the
pth columns of the probe weight matrix for several squares to swap the positions of black pawns, ensuring that white’s
e4 pawn has no legal moves:
Post-intervention, the model recommends moving the
d2 pawn with 76% of the probability mass, and the
e4 pawn receives a negligible amount of the probability mass. A heatmap of the pre-intervention outputs and post-intervention outputs are shown in the bottom of
Figure 4.
4.7. Move Validity Post-Intervention
We sample the output logits from 498,058 interventions to generate four sets of output logits: the “Control” (aka the “None” intervention), “Probed”, “Randomized”, and “Patched”. We softmax the logits in each sample to get
, a probability distribution over the model vocabulary and then compute
, the legal move probability mass (LMPM) according to the original board position. Furthermore, for the samples from the probe-based interventions, we compute
, the LMPM according to the target board position, and for the patch-based interventions, we compute
the LMPM according to the future board position. The distribution of the results is shown in
Figure 5.
Observe the “None” column in
Figure 5, which indicates our GPT assigns 0.9969 mass to legal moves (
= 0.016) across all intervention positions. One significant finding is how precise the Probed intervention technique is. Specifically, notice that adding random noise in the Randomized intervention barely changes the LMPM versus the None intervention. In fact, the argmax of the None and Randomized logits match 98.75% of the time! This is significant because the Randomized intervention vector is scaled to be exactly the same length as the Probe intervention vector. In contrast, the argmax of the Probed intervention logits only matches the None logits 0.05% of the time. So, with the same magnitude of intervention, we see that the Probe intervention vector has an out-sized impact on the model’s performance.
When we evaluate the Probe intervention logits against our target board position, i.e., the state of the board we are trying to force the model to consider, we see the LMPM is densely concentrated toward the top of the chart (
,
). In contrast, the Patch intervention has a detrimental effect on model performance, despite the intervention vector itself being a valid representation of the hidden state from only a few moves later in the game trace. When graded versus the original board position, the mean LMPM is 0.550 (
), and when graded versus the future board position, the mean LMPM is only slightly higher at 0.620 (
) (see the right-most panel in
Figure 5). This shows that at an equivalent scale (
), the patch intervention disrupts the ability of GPT to generate legal moves according to both the original board and the future board.
These results indicate that our linear probing technique is both effective and precise. Whereas the random interventions contain too little latent information and the patch interventions contain too much, the probing interventions are the Goldilocks of the three, steering the model toward the desired state without inhibiting the model’s ability to produce legal move sequences.
4.8. Move Validity Post-Intervention with SAEs
Following the procedure in
Section 4.7, we sample output logits from an additional 403,848 interventions to evaluate our SAE-based probes. The output logits are partitioned based on the intervention applied, “None” is the control from the baseline forward pass of the GPT, “Probed_CD” uses the weights of the SAE-based contrastive difference probe, and “Probed_SAE” uses the weights of the linear probe trained on the SAE latent vectors. We apply softmax to the logits and report the LMPM according to the board position, either the original state or the target state. These results are summarized in
Figure 6.
The GPT achieves a mean LMPM of 99.7% (
= 0.01) according to the original board state (
Figure 6, left). As a baseline for comparison, we grade the control “None” intervention against the target board state; it achieved a mean LMPM of 41.0% (
= 0.22) in this setting (blue violin in
Figure 6, right). This is to be expected. On average, the GPT allocates approximately 60% of the probability mass to its most-favored move. When the corresponding piece is removed, the entirety of that allocation is invalidated. The mass that remains valid does so only because there are multiple legal moves available in the overwhelming majority of board positions, and the GPT allocates some probability mass to those alternative moves.
Unfortunately, the SAE-based probes are only slightly better. A one-sided Welch’s t-test quantifies the standardized difference between the means of two groups. In this case, the test produces a p-value of for the SAE-based contrastive difference probe versus the control and a p-value of for the SAE-based linear probe versus the control. Although these small p-values allow us to conclude that the increase in LMPM by the SAE-based probes is statistically significant, in practice the improvement is too small to meaningfully alter the model’s behavior or justify the use of these SAE-based probes for residual stream editing.
A likely explanation for the poor performance of the SAE is not primarily information loss from sparsity but rather the fact that SAEs are trained in an unsupervised fashion and therefore lack any mechanism for enforcing an ontologically coherent decomposition of the model’s latent space. During training, the SAE optimizes reconstruction quality under a sparsity penalty, but this objective does not necessarily align its learned features with domain-specific structure—in this case, the highly constrained ontology of chess piece identities, colors, board locations. As a result, the SAE often produces features that combine semantically unrelated components whenever such combinations reduce reconstruction error. A useful analogy is that the SAE behaves less like a careful disassembly of the model’s internal representations and more like an indiscriminate fragmentation process: it breaks the residual stream into parts that are convenient for minimizing its training objective, not parts that correspond to meaningful chess concepts. Because the SAE is not guided toward preserving the modularity inherent in chess (e.g., piece-type distinctions, spatial structure, and turn order), we believe the resulting decomposition is poorly aligned with the ontology needed for effective representation editing.
5. Conclusions
This article advances the study of emergent representations in transformers through several key contributions. We release a chess-playing GPT model that improves upon the legal move rate of the LCB baseline and evaluate whether the hidden state representations honor the Markovian property of chess. We introduce a new evaluation metric—legal move probability mass (LMPM)—and use it to assess the intervention performance of probe classifiers trained on the original residual stream vs. ones trained on Sparse Autoencoder latents. The SAEs failed to reliably reconstruct board positions or support accurate causal interventions, suggesting that they did not capture the relevant emergent features of the board state. Although SAEs may still hold value in certain interpretability contexts, our findings add to a growing body of evidence suggesting that, in some settings, they are no better—and sometimes markedly worse—than simpler techniques.
This pattern is consistent with some of the prior work in the area. In [
26], SAEs trained on chess and Othello models capture some board-state information but still lag behind linear probes on supervised board reconstruction metrics, and ref. [
34] finds that SAE-based probes rarely outperform strong baselines trained directly on activations, winning on only a small subset of their 113 probing datasets. Moreover, ref. [
34] argues that some reported SAE gains in the literature may stem from weaker or mismatched baselines, for example, max-pooling activations across tokens in ways that hurt baseline performance, rather than from an intrinsic advantage of SAEs themselves. Taken together, these results suggest that for practitioners choosing between complex SAE pipelines and simpler linear probes, the latter will often provide comparable or superior performance at lower implementation and computational cost, unless there is a clear task-specific justification for the added complexity of SAEs.
A significant constraint of probe classifiers is their reliance on a priori concept definition. While selecting target features is trivial in bounded domains like chess—where game rules dictate clear concepts—it becomes increasingly difficult in the open-ended landscape of natural language, where the space of plausible concepts is vast. Consequently, our future research will focus on developing unsupervised methods that can automatically discover relevant concepts using external data, removing the bottleneck of manual selection.