On Memorization and Generalization in Compact Transformers

Härmä, Aki; Al-Saeedi, Ali; Changalidis, Anton; Verşebeniuc, Dumitru; Pietrasik, Marcin; Wilbik, Anna

doi:10.3390/electronics15091847

Open AccessArticle

On Memorization and Generalization in Compact Transformers

by

Aki Härmä

^*

,

Ali Al-Saeedi

,

Anton Changalidis

,

Dumitru Verşebeniuc

,

Marcin Pietrasik

and

Anna Wilbik

Department of Advanced Computing Sciences, Maastricht University, 6211 LK Maastricht, The Netherlands

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(9), 1847; https://doi.org/10.3390/electronics15091847

Submission received: 9 February 2026 / Revised: 13 April 2026 / Accepted: 21 April 2026 / Published: 27 April 2026

(This article belongs to the Special Issue The Future of LLM Architectures)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Large language models (LLMs) seem to demonstrate human-like understanding and generalization of language content. These arise from the capabilities of the models to memorize and generalize the training content. In this paper, we review the recent literature and theories on the mechanisms in self-attention neural networks. We also report three computational experiments that give insights into the underlying mechanisms and capabilities of the models. We also report three computational experiments showing that memorization capacity in compact transformers can be empirically linked to architectural parameters, that structured domain knowledge can be retained in small decoder-only models, and that in-context abstraction requires sufficient architectural depth. These findings suggest that the current models are superfluous for many specific applications, especially in on-edge use cases. A better understanding of application requirements and architecture details can be expected to help in building new LLM architectures that can be efficiently implemented on dedicated on-edge circuits.

Keywords:

natural language processing; large language models; memory capacity; generalization; on-edge AI; transformer architectures; ASIC implementations

1. Introduction

Transformer architectures have recently achieved remarkable results in a wide range of applications, including natural language processing [1,2], speech recognition [3], and image processing [4]. The characteristic feature of the transformer model [5] is the self-attention circuit [6]. Essentially, it learns to compute weighted sums of input vectors based on the input vectors themselves. Large language models typically have multiple layers of parallel sets, or multihead, self-attention circuits, and connecting feedforward layers, which together may have billions of trainable parameters. The parameters are optimized using stochastic gradient backpropagation methods [7]. However, at the time of inference the coefficients of the model are frozen, and the model produces outputs in an autoregressive manner based on the input prompt or data. The self-attention circuits and the feedforward neural network models in LLMs can be seen as associative memory devices which have a remarkable theoretical capacity to store information and content and generalize from the learned sequences of text to solve various types of questions answering, classification, and generation tasks.

1.1. Research Gap

Insights of how transformer models store and recall especially structured knowledge is still somewhat limited. A better understanding of these mechanisms can help optimize the model performance in practical applications where very large general-purpose models are not practical or desired. For example, in healthcare applications, a transformer-based model could assist clinicians through information displays and wearable devices such as watches or smart glasses [8,9,10]. The preferred system, for privacy and reliability reasons, would run in a local on-edge device, and require only small computing power but with the ability to retrieve essential knowledge in the topic area. In an edge or wearable use case, the inference of a transformer model can be implemented as a static Application-Specific Integrated Circuits (ASIC) system [11], which will be significantly more energy efficient than implementation on a Field Programmable Gate Array (FPGA) or a programmable processor. For example, ref. [12] refers to results showing that an FPGA implementation of typical tensor computations is about 25 times more performance/power efficient than a V100 GPU, and an often-cited study by [13] shows that an ASIC implementation was found to be 87 more efficient than FPGA. Currently, several chipmakers provide ASICs, such as the Google TPU, for deep learning applications. Taking into account the requirements of the target application, the transformer size, that is, the dimensions of the internal vector and the number of heads and layers, can be further minimized. Stochastic rounding in floating-point arithmetics included in current standards [14] will further support the static implementation of transformer-based generative models in ASIC systems [15]. This because of the use of output randomization, temperature, improves the model performance in text generation and reasoning [16]. In summary, an ASIC implementation scaled to a target problem may improve power efficiency by several orders of magnitude over a programmable solution.

Recent theoretical and empirical studies have aimed to characterize and quantify the memorization capacity inherent to transformer-based architectures. Provable bounds on memorization capacity were established by Kim et al. [17], who showed that a transformer can store up to

O (d + n + \sqrt{n N})

data points, with

d, n, N

denoting embedding dimension, dataset size, and model size, respectively. Subsequent work by Kajitsuka and Sato [18] demonstrated that

\tilde{O} (\sqrt{n N})

parameters constitute not merely a sufficient but also a necessary condition for certain transformer variants. These theoretical results were further refined by studying how multi-head attention shapes memorization dynamics, uncovering tight dependencies between specific architectural choices and a model’s capacity to encode and retrieve learned content [19].

1.2. Contributions

In this paper, we review some of the recent work in characterizing the architectural requirements of transformers in specific edge applications. Some parts of the content have been summarized from three earlier papers by the current authors [20,21,22]. The transformer can be seen as an associative memory that has a large capacity to memorize the training content. In some cases, the model can retrieve or regurgitate the training data exactly. On the other hand, the model can also generalize or create abstractions of the content seen in training—specifically, we analyze how architectural choices, such as positional encodings and data formatting, determine whether a model can extrapolate rules to longer sequences.

1.3. Novelty

In order to better understand the capabilities of the architectures, the results of three computational experiments are reported in this contribution. The experiments, together with an overview of the literature and models to characterize the mechanisms, give a good understanding of some of the capabilities of the LLMs. The proposed Empirical Capacity Model (ECM), certain design laws for synthetic text learning for compact transformer architectures, and the abstraction head mechanism are novel contributions of this article. The main focus is on transformer properties, but we will also discuss the implications on circuit design of on-edge intelligent systems and robotics.

2. Methods and Materials

In the first setting, we try to characterize the memory capability of the transformer systems in case of collections of random sequences. We also derive a design formula and an empirical capacity model, ECM, which link the architectural parameters and memorization capacity. This is followed by a memorization study with synthetic sentence material generated from knowledge graphs in the healthcare domain. Next, we demonstrate the properties of in-context learning in compact transformer models in the use of synthetic sequences and sequence templates. We also provide an overview of the generalization performance of the models with the main focus on length generalization.

There are studies on learning explicit facts or relational patterns in sequences, for example, in [23,24], but it turns out to be difficult to define the goal precisely. In the experiments of this paper the goal is to memorize, i.e., to overlearn the sequences exactly as they are presented in the training material. This is also called shattering [25]. The self-attention circuit in the layers of a transformer model is essentially an associative memory. It can memorize data, and the capacity depends on the count of the model parameters. Transformer models resemble Hopfield networks [26] and other associative memory architectures in many ways [27,28,29]. Associative memory circuits have a significant capacity to store information that depends on the number of parameters and the choice of architecture [17,19]. However, it is challenging to convert those findings into a realistically achievable capacity. Allen-Zhu et al. have studied the capacity in transformers with synthetic data generated from knowledge bases [30,31,32]. The goal was to retrieve a piece of human knowledge [31], which may be hard to define using real text data. They provided a rule, based on experiments, that each parameter of a transformer can store two bits of such data [31].

The scaling laws derived for large transformer models are usually expressed in terms of training loss [33]. The behavior of the loss does not directly measure the capacity. Kim et al. demonstrated that the capacity of transformer models exhibits a behavior that is asymptotically in line with theoretical models [17]. The model introduced by Mahdavi et al. characterizes the capacity of a single self-attention layer and validated it by comparing it with empirical performance [19]. However, comparatively few previous works have investigated the practically attainable storage capacity of full transformer models with the specific goal of deriving actionable design guidelines.

2.1. Memorization of Random Sequences

In this first experiment, the goal was to predict memorization by training models of varying sizes on generated sequence data. Using a range of experiments that modify the model size and architectural parameters, we can fit a function that predicts the expected capacity of a given model. The model’s prediction can be viewed as an empirical lower bound on capacity, making it a practical instrument for designing architectures that meet the performance demands of a specific application. The proposed tool can forecast the attainable capacity of new models with different architectural parameters. The use of the model in the design process can lead to cost savings, reduction in energy use, and carbon emissions, and other advantages of on-edge hardware for efficient inference.

2.1.1. Capacity in Networks

A fundamental characteristic of a Hopfield network is the storage capacity, defined as the largest number of unique data patterns that the network can retain. The capacity in a Hopfield network with N nodes can be expressed by

C N / l o g (N)

, in which

C < 1 / 2

[34]. Typically, data items are Boolean vectors of length N. If a Hopfield network had a size similar to a GPT-3 language model [1], with 175B parameters, it could, in theory, be able to memorize more than three billion 2048 token sequences. The size of the training data set used for this model is 500 billion tokens, which corresponds to roughly 7 percent of the theoretical capacity of this thought experiment network. With Zhu’s rule of 2 bits per parameter [30], this would correspond to more than 40 Gb of human knowledge.

Storage can be measured explicitly, as above, but also indirectly. For example, the mean squared error (MSE) between the original and recalled patterns is obviously a related measure store [35]. The recall rate [36] can also be evaluated, defined as the frequency with which patterns are accurately reconstructed when provided with a pattern with a known count of missing bits. The minimum initial cue, defined by Steinberg et al. [37], gives a minimal recall error estimate. The minimum initial cue is given by

l_{c} = \frac{L_{0}}{L}

, so that L is the length of the pattern and

L_{0}

is the length of the cue, respectively.

The basin of attraction [37,38] (BoA) is also often used to characterize memorization. The BoA can be interpreted as the region surrounding a pattern in which every state is drawn to that pattern within a specified period of time. There is no closed-form analytical method to investigate the BoA; instead, it is typically examined by plotting a graph that relates the number of successful outcomes from many trials to the number of input errors.

Krauth [39] introduced the concept of the degree of symmetry for Hopfield networks, which can be obtained by

\frac{Σ_{i = 1}^{n} Σ_{j = 1}^{n} w_{i j} w j i}{Σ_{i = 1}^{n} Σ_{j = 1}^{n} w_{i j}^{2}}

(1)

Many researchers have argued that the removal of the full connectivity and the complete symmetry of the corresponding weight matrix is necessary to reach maximum capacity. Other authors contend that because the weight matrix is asymmetric, the number of spurious attractors increases.

An FFN with a single hidden layer containing

N / d

hidden units has been shown to exactly represent any two-class classification task, or dichotomy, defined on N vectors in d dimensions [40]. In other neural network architectures, ReLU feedforward networks were studied in [41], where the capacity approaches

\tilde{O} (\sqrt{N})

. Kim et al. [17] suggested that an LLM with

\tilde{O} (d + n + \sqrt{n N})

parameters should be able to store N sequences, where a vector dimension of the token is d and the number of tokens is n. However, this observation does not yet yield a concrete design principle that specifies a memorization requirement. Mahdavi et al. [19] showed that a H-headed self-attention model with

Θ (H d^{2})

parameters can have the capacity to store

O (H N)

sequences of N elements. Yet, the construction assumes

d = d_{h}

, which is not the standard setting in typical transformer architectures. Overall, one can see that when the suggested dimensions are large, even moderate sized transformers should have a significant capacity to store data.

2.1.2. Transformer Models

The input of the model

X

consists of a sequence of N discrete symbols, or tokens,

t_{i}

. Moreover, each sequence of tokens is represented by an array of vectors with B elements

x_{i}, i = 0, \dots, N - 1

. The self-attention, SA, operation is commonly given by

Attn (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(2)

where the matrix terms are obtained from the input X by

Q = X W_{Q}

,

K = X W_{K}

, and

V = X W_{V}

, respectively. The matrices

W_{Q}

,

W_{K}

, and

W_{V}

have dimension

B \times d_{h}

that corresponds to the selected token vector size

x_{i}

B, and a head dimension

d_{h}

. The three matrices are learned based on training data. Manipulation of the

Q K^{T}

term gives

Q K^{T} = X W_{Q} {(X W_{K})}^{T} = X W_{Q} W_{K}^{T} X^{T} = X W_{A} X^{T},

(3)

The quadratic form on the right-hand side, with one

d \times d

matrix

W_{A}

, is equivalent to the

Q K^{T}

with two

d_{h} \times d

matrices. In LLMs, it is common to make

d > > 4 d_{h}

, which makes the second query–key form of Equation (3) much more efficient than a

d \times d

matrix operation. Notably, with

d \leq 2 d_{h}

, one can see that the effective number of parameters in a SA circuit is

d^{2} + d d_{h} \leq 3 d^{2}

.

If

W_{V}

is removed, which is often the case in LLMs, the operation is equivalent to

Attn (Q, K, V) = M X

(4)

Now we can see that

M = f (X)

, where

f ()

represents the function learned from the training data. Thus, the SA operation is to multiply the array of input vectors in X by a matrix that depends on X.

A complete transformer architecture, illustrated in Figure 1, is composed of multi-head attention modules, each comprising H parallel and independent self-attention (SA) mechanisms. An arrangement of H SA units operating in parallel on the same input is referred to as a multi-head attention block. In practice, large-scale models typically consist of multiple stacked layers of such multi-head attention blocks. For clarity, normalization and dropout components are omitted from the figure; however, they are included in the implementation employed in the experiments described below.

Each layer also has a feedforward neural network, FNN, typically followed by a GELU activation function. The FFN system often has even more parameters than the SA circuits. It seems that FFN has an important role in memorizing often recurring phrases and patterns, key–value memories, in text content [42]. However, it is commonly assumed that the SA circuits play a central role in sequence memorization. To demonstrate this, we compared models with standard trainable FF layers to models in which the coefficients of the FF layer were frozen to implement an identity mapping, passing inputs to outputs unchanged. The performance of a two-head, two-layer transformer for various token vector sizes B is presented in Figure 2. For small vector dimensions, the curves coincide, indicating that the FF layers do not play a role in memorizing the synthetic data used in this work. At higher values of B, the capacity appears to even improve when the FFN parameters are frozen. A plausible explanation is that the fully trainable model would need more training iterations to reach the same capacity as the simpler model in which the FF layer is hard-wired.

The overall parameter count in an SA network is determined by the vector dimensions

d_{h}

and d, the maximum input sequence length N, and the number of heads H and layers L. Figure 3 shows the total number of trainable parameters for different model configurations. In the experiment based on random sequences, we keep the parameters of the input embedding and the final output linear layer fixed. Consequently, only the parameters of the SA and FFN modules remain trainable.

2.1.3. Data Generation

In autoregressive modeling, the aim is to predict the next token

t_{p}

in a sequence given the past tokens, i.e.,

t_{p} = {argmax}_{k} F (t_{k} | t_{i}, i = 0, \dots, k - 1)

(5)

The model

F ()

above represents the transformer combined with a softmax activation function so that

F (x_{i}, i \neq k) = softmax (Attn (Q, K, V))

(6)

The capacity C in this experiment measures the number of sequences in which the network can correctly predict the Nth token from a sequence of

N - 1

preceding tokens. The chance of choosing the correct token at random is

p = 1 / T

. Consequently, the chance that a model generates

r \leq R

times the correct token in the case of K data points is given by

P (r < R) = \sum_{r = 0}^{R - 1} (\binom{K}{r}) {\frac{1}{T}}^{r} {(1 - \frac{1}{T})}^{K - r}

(7)

When

K = 2048

and

T = 128

, for example, the chance of 25 correct generations is

P (r < 25) = 0.96

. Furthermore, if the probability of more than, say, 40 right guesses in this system is very unlikely, that is,

1 - P (r < 40) \approx 2.8 \times 10^{- 7}

. The probability measure in (7) can be used to evaluate how the empirical capacity measures introduced in this experiment hold. For a library of size K, the expected number of correct guesses is given by

C_{offst} = \frac{K}{T} .

(8)

In Section 3, we describe the results of a large number of experiments with synthetic sequences and fit a model to the results to get an empirical rule linking memory capacity and architectural parameters.

2.2. Memorization of Synthetic Sentences

The second experiment employs sequential data derived from a knowledge graph that, although generated under controlled conditions, retains much of the hierarchical and relational structure found in real-world text. In practice, small-scale decoder-only transformer models [1] were trained to memorize structured sentences produced from the Systematized Nomenclature of Medicine (SNOMED) knowledge graph (KG) [43], a large medical ontology encoding semantic relationships among clinical concepts, providing a rich data set for studying memory mechanisms under realistic conditions. In the healthcare scenarios outlined above, exact retention of specific relational facts would be critical. Rather than targeting all LLM types or domains, the goal is to establish a practical and reproducible protocol for assessing memorization on realistic KG-derived data. Task simplicity is an intentional design choice: introducing more complexity or reducing experimental control would mix memorization with generalization effects, making it difficult to draw clear, interpretable conclusions about model capacity.

These experiments serve as a proof-of-concept, demonstrating that structured real-world data provides a viable substrate for empirical memorization evaluation. Three contributions emerge from this setting: a reproducible pipeline for converting large ontologies into tokenized training corpora; an empirical evaluation of how transformer architecture affects memorization capacity, informed by prior theoretical work; and an analysis of failure cases—instances where models with nominally sufficient capacity nevertheless fail to retain all training samples—pointing to open questions in training dynamics and error analysis.

Our results are not intended to define universal scaling rules or generalization patterns, but rather to introduce a reproducible framework for examining memory-constrained models under practical, real-world limitations.

2.2.1. Data Generation

To evaluate transformer memorization and retrieval behavior, the SNOMED KG was used as the underlying data source. This graph represents medical concepts as nodes and semantic relationships as directed edges. It was accessed using the owlready2 library [44], with non-informative and overly specific properties excluded to retain only semantically meaningful links. Rather than using graph neural networks as in graph transformer approaches [45], a universal architecture was adopted that converts the graph into two flat representations: (1) relational triplets of the form (concept, property, related concept), and (2) sequences produced by simulated graph traversal.

2.2.2. Triplets Generation

A dataset of the form (Concept, Property, Related Concept) was constructed from the SNOMED KG (see Figure 4A). The construction proceeds in two stages: first, non-informative properties are removed; second, for each concept node, all valid properties and their associated target concepts are extracted. When a given (Concept, Property) pair yielded multiple target concepts, one was drawn at random to ensure uniqueness across the dataset.

2.2.3. Sequences Generation

The generation of sequences by graph transversal is illustrated in Figure 4B. We first remove banned properties and add reverse edges to enable bidirectional traversal. For each sample, a random starting node was selected, a breadth-first search (BFS) subgraph of depth 5 was extracted, and a sequence of the form ( ${node}_{1}$ , ${edge}_{1}$ , ${node}_{2}$ , ⋯, ${node}_{n - 1}$ , ${edge}_{n - 1}$ , ${node}_{n}$ ) was randomly generated by traversing previously unused (node, edge) pairs. The target number of edges was sampled from 3 to 5; generation stopped when this limit was reached or when no valid neighbor remained. This process was repeated until the required number of sequences was obtained.

2.2.4. Transformer Training

Decoder-only transformer models were implemented across a range of architectural configurations. Each distinct element (whether a node or an edge label) was assigned a unique integer identifier, repeated occurrences receiving the same token, and positions were encoded through a learned positional encoding. Each model comprised an embedding layer projecting token identifiers to continuous representations, one or more decoder blocks with multi-head attention, and a final linear projection for next-token prediction.

Across all setups, the task was to predict a target concept given the preceding sequence of concepts and relational links. Prediction quality was measured as the ratio of correctly predicted target concepts to total predictions

\frac{# c o r r e c t_p r e d i c t i o n s}{# t o t a l_p r e d i c t i o n s}

—the proportion of correctly predicted related concepts to the total number of predictions. maximum attainable capacity (MAC) was also adopted as a capacity metric, offering a computationally efficient alternative to maximum library size (MLS). MLS requires iteratively fitting models to growing datasets until the largest fully memorizable library is identified, whereas MAC directly estimates the peak number of samples a model can retain when drawn from a large library. The two measures have been shown to correlate strongly [20], making MAC a practical substitute for this study.

To reduce the impact of randomness, each configuration in Setups 1–2 was repeated 10 times, whereas each configuration in Setups 3–4 was repeated 3 times. Unless stated otherwise, reported values and plotted curves correspond to the mean across repeated runs, with variability shown as

\pm 2 σ

(twice the standard deviation). Training accuracy was evaluated every two epochs for all setups.

All models were built using PyTorch v1.13.1+cu117 [46] and transformers v4.30.2 [47], which were optimized against the cross-entropy loss using the Adam optimizer at a learning rate of

0.001

[7]. To facilitate reproducibility, a fixed base random seed (566), was used throughout, together with deterministic CUDA/cuDNN settings and seed offsets applied across repeated runs. Each configuration was executed 10 times for Setups 1–2 and 3 times for Setups 3–4; we report the mean

\pm 2 σ

. In total, 546 models were trained on the NVIDIA A100 GPU, totaling approximately 3100 h of training time. Model sizes spanned

2.9

to

44.5

million parameters, with embedding size and layer count as the primary sources of variation, supplemented by vocabulary size.

2.2.5. Triplets and Sequences Memorization

Three experimental setups were constructed for the triplets dataset. In each case, the target concept was predicted from a unique concept-relation pair, making the correctness unambiguous.

The first setup held the architecture fixed while varying the training set size between

50,000

and

100,000

samples. A single transformer layer was used throughout (embedding size 128, four attention heads, Rectified Linear Unit (ReLU) activation [48], batch size 64, 500 epochs). This configuration was designed to assess memorization performance using a fixed architecture while changing the dataset size.

The second setup varied both depth and activation function: models with 1, 2, or 4 transformer layers were combined with four activation choices (ReLU, Gaussian Error Linear Unit (GELU) [49], Randomized Leaky Rectified Linear Unit (RReLU) [50], and Softmax [51]), with dataset sizes of

50,000

,

70,000

, or

100,000

triplets. To improve comparability between architectures with different depths, we fixed the embedding budget at 128 and adjusted the embedding dimension (d_model in the PyTorch transformer implementation) inversely with the number of layers, according to

d_model = ⌊\frac{128}{n_layers}⌋ .

This fixed-budget design constrained the models’ representational capacity and reduced the confounding effect of larger per-layer representations when comparing configurations with different depths. Batch size was set to 128 and training ran for 1000 epochs, as shorter runs did not reliably converge to a stable performance plateau.

The third setup examined the interaction between model depth and embedding size, while all other hyperparameters were held constant. Layer count was set to 1 or 2, and the base embedding size took values in

{16, 32, 64, 128}

(total parameter counts were recomputed for each configuration using the same formula as in the second setup). Dataset sizes of 1000,

10,000

,

50,000

, and

100,000

triplets were used. Only a softmax activation function and four attention heads were employed. To enable fair comparisons, the setups were constructed to assess how increasing the embedding dimensions and model depth affects memorization. In this configuration, the batch size was set to 128, and training was performed for 500 epochs.

For the sequence memorization task, the same tokenization procedure as for triplets was applied, extended by two standardization steps: sequences were zero-padded to a uniform length (serving as both a placeholder and an end-of-sequence marker), and a node mask was used to separate node tokens from edge tokens during metric computation. Each node was predicted from all tokens that preceded it, so the final node in a sequence had access to the greatest amount of context. This design offered additional insight into how well the models capture the relational patterns inherent in structured sequential data.

The architecture for training on sequences matched the triplet setups: embedding size 64, four attention heads, batch size 128, and 400 training epochs. Models with 1, 2, or 4 layers were evaluated, using RReLU and Softmax activations. The dataset sizes were

20,000

,

50,000

, and

100,000

sequences, each containing 4–6 nodes (3–5 edges), built from BFS subgraphs of depth 5.

Accuracy for this experiment was computed as the fraction of correctly predicted tokens at node positions out of the total number of node-level predictions across all sequences, excluding sequence-starting nodes. The aggregate count of correct predictions equals the MAC for this setting.

All code and data pertinent to this section are available at https://github.com/um-dacs-nlp/capacity/ (accessed on 10 April 2026).

2.3. Generalization and Abstraction

A critical question in the study of transformer-based language models is to what extent their reasoning abilities arise from memorization of training data versus generalization. In this work, we use the term generalization to highlight the ability of a model to correctly process inputs that cannot be solved by recalling previously seen sequences. A model would instead require the identification of a pattern and generalizing to similar abstract instances. We focus specifically on in-context generalization, where the model identifies the pattern during inference without any parameter updates or explicit training on the unseen instances.

The idea that large language models can acquire task-relevant behavior inference is well established. Extremely large models have been shown to perform in-context learning and chain-of-thought reasoning, exhibiting capabilities that are not directly attributable to memorization of their training data alone [1].

To make this feasible, it is important to use a task where only memorization would not be sufficient to solve the problem, as using the capabilities of transformers to memorize would not provide enough evidence of abstraction or reasoning. In the following sections, we introduce an abstract sequential symbolic pattern problem, in an attempt to understand the capabilities of generalization in transformers.

2.3.1. Data Generation

To study generalization independent of memorization, we constructed a synthetic sequence prediction task based on abstract symbolic templates. The goal of the task is to require the model to understand the underlying abstract structure of the input context and apply it to new instances during inference. An abstract template is defined as a symbolic sequence composed of a small set of variables such as A, B, and C, arranged according to a fixed pattern. For example, the template

ABCABCAB

represents a symbolic structure in which the variables represent placeholders rather than specific token values. The template can then be instantiated by assigning concrete values to the variables. For example, if we assign (A, B, C) = (1, 2, 3) results in the concrete sequence 12312312, while (A, B, C) = (4, 5, 6) results in the sequence 45645645. Although these sequences differ at a token level, they share the same abstract structure. To ensure that the solving the task is possible for both a human and the algorithm, we exclude cases where not all variables are present (e.g., ABABABAB), or where the final position is a newly introduced variable (e.g., ABABABAC). Abstract sequences that can result in the same concrete values are treated as duplicates, resulting in a reduced set of unique abstract patterns (e.g., ABCABCAB = ACBACBAC when assigning (A, B, C) = (1, 2, 3) and (A, B, C) = (1, 3, 2), respectively).

Each training and evaluation consists of multiple instantiations of the same abstract pattern concatenated into a single input sequence. Specifically, four distinct instantiations are present in the training/evaluation instance. This choice ensures that the model is exposed to the same abstract structure which gives enough context for generalization as it is exposed to at least one instantiation of an abstract pattern, before generalizing on the rest on the input sequence. For example, the abstract pattern ABCABCAB produces the following input sequence:

[12312312|45645645|78978978|15915915]

To further test the model’s ability, we appended the unique values of the variables used in every instantiation to the end of each input sequence. Since variables A, B, and C represent abstract placeholders, their concrete values are appended in the order of their first appearance. For example, consider an instantiation of the abstract pattern ACCABBAC. The first appearance order of the variables is A, C, B, so the appended token sequence is ACB, rather than ABC. This encoding reflects the symbolic structure actually observed in the instantiation rather than the abstract template itself, which the model has no access to (i.e., there is no way for the model is the abstract token to be an A B or a C). Applying this transformation to the earlier example (ABCABCAB) gives the following input sequence.

[12312312|45645645|78978978|15915915|123|456|789|159]

This addition requires the model to not only infer the abstract template, but also to symbolically map the abstract variables and the concrete values according to their order of appearance.

The task is formulated as an autoregressive next-token prediction problem, where the concrete instantiation that appears later in the sequence has not been seen during training or earlier instantiations in the sequence. As a result, correct prediction cannot be achieved by memorization but instead requires the model to infer the abstract pattern from earlier instantiation and apply it to unseen symbol assignments.

2.3.2. Evaluation

To quantify generalization performance on the abstract symbolic task, we evaluate using two metrics: last-token prediction accuracy and variable matching accuracy. The primary metric is last-token prediction accuracy, which for each input sequence containing four instantiations of the same abstract template, the model must correctly predict the final token of the four instantiations. It is not possible to infer the value of this token through memorization of previously seen sequences, as the concrete instantiation is unseen and only the abstract structure is shared with the model, in addition to at least one appearance of each variable to ensure the model has access to sufficient information.

The second metric evaluates the variable matching accuracy. After processing the instantiations, the model must output the symbolic values corresponding to each instantiation. The accuracy here is computed as the proportion of correctly predicted variables. I.e., the metric measures how well the model tracked the end of the concrete sequences, in addition to remembering the concrete values of the abstract placeholders. Together, these metrics would help us quantify the performance of a model on an abstract task.

2.3.3. Models

To investigate how architectural depth influences abstract generalization, we evaluate a set of transformer models with different numbers of attention heads and layers. All models follow the standard transformer architecture with multi-head self-attention and a feed-forward layer and are trained on the synthetic autoregressive task with cross-entropy loss as described in the previous sections. We experiment with three primary architectures: a 1-layer model with 8 attention heads, a 2-layer model with 4 heads per layer, and a 3-layer model with 2 heads per layer. Across all architectures, the hidden dimension is fixed at 128, and each model consists of a single feed-forward layer. Absolute positional embeddings are employed to represent positional information within the input sequences. All models were trained using Adam optimizer with a learning rate of

10^{- 3}

and a batch size of 64. Training continues for 100,000 optimization steps.

2.4. Generalization in Algorithmic Tasks

In this section, we focus primarily on length generalization, as this area has seen substantial foundational work [52,53,54], allowing clearer formalization of its core principles. In contrast, compositional generalization, while a prominent topic in modern research, often centers on large-scale reasoning models such as Gemini Thinking, ChatGPT variants (e.g., o1, o3), DeepSeek-R1, and rStar-Math [55]. These models demonstrate impressive performance in algorithmic reasoning tasks, largely due to training methodologies involving supervised fine-tuning (SFT) and reinforcement learning with one of reward policy: group relative policy optimization (GRPO) [56], process reward models (PRMs) [57], or Monte Carlo tree search (MCTS) [55]. However, advancements in compositional generalization tend to emphasize training procedures (e.g., reward shaping or search-based optimization) rather than explaining how models internally generalize patterns. Furthermore, this field evolves rapidly, with shifting priorities toward enhancing “algorithmic thinking” rather than architectural interpretability.

While there is overlap in tasks like addition (where length and composition interact), they test distinct capabilities: one tests the mastery of a procedure, and the other tests the flexibility in problem-solving.

Length generalization reflects algorithmic internalization: the model applies a fixed procedure to arbitrary input lengths.
Compositional generalization reflects algorithmic reasoning: the model dynamically assembles subroutines into novel workflows.

Consequently, length generalization provides a more stable and structured framework for investigating architectural mechanisms in transformers. By analyzing how models extrapolate to longer sequences independently of training dynamics, we can better isolate and improve their inherent generalization capabilities. Although this approach is usually tested on small transformer models and does not scale well to real-world problems in practice, it can serve as a foundation in the right direction.

Table 1 illustrates the diversity of algorithmic tasks studied in the literature—ranging from mathematical and reasoning tasks such as Addition [58], polynomial evaluation, sorting, summation [59], parity [60], and LEGO [61]—this paper prioritizes addition as a foundational case study for probing length generalization in transformers. We justify this choice as follows:

Simplicity and Interpretability: Addition is a well-defined deterministic task with minimal combinatorial complexity compared to operations like polynomial evaluation or sorting. Its stepwise nature (e.g., digit-wise processing with carry propagation) allows for granular analysis of how transformers encode sequential dependencies and positional reasoning.
Controlled Scalability: The input length, in addition algorithmic tasks, can be systematically extended (e.g., from 5-digit to 10-digit numbers) without changing the underlying algorithm. This facilitates a precise evaluation of the generalization beyond training durations.
Prior Work: Addition has served as a canonical task in length generalization studies [52,53,54], enabling direct comparisons with existing architectural modifications (e.g., positional encoding schemes, attention biases) and training paradigms.

2.4.1. Preliminaries

Several recent architectural improvements, especially in position encoding [62,63,64] and attention mechanisms [65,66], have been proposed to tackle the length generalization challenge in arithmetic tasks using transformers. However, these modifications are often limited by their ad hoc nature or poor performance on longer sequences. Although scaling model and dataset sizes are known to improve performance, it might not be sufficient for generalizing to test sequences longer than 3 seen during training [60]. Therefore, in addition to architectural improvements, data-centric AI has driven research [67,68] to refine data formats to improve the learning quality of transformers. This section will review common data formats (Section 2.4.2) and positional embeddings/encoding methods (Section Positional Embeddings/Encodings (PE)) relevant to length generalization with a focus on decoder-only transformers.

2.4.2. Data Generation

The structuring of data plays a role in improving the length generalization capabilities of transformer models by reformatting data into a representation that facilitates more effective learning. In the following, we provide an overview of the existing methodologies in this domain.

Reversed Format

Recent studies have demonstrated that reversing the response in arithmetic problems can substantially improve both the performance and the efficiency of the sample in neural models. For instance, ref. [69] shows that transforming an expression such as

653 + 49 = 702

into its reversed format,

653 + 49 = 207

allows a decoder-only transformer to generate the answer starting from the least significant digit (LSD) and proceeding towards the most significant digit (MSD). This reversal mirrors the traditional algorithm taught in elementary school, where addition is performed digit-by-digit from the LSD to the MSD.

Standard arithmetic expressions are typically written as

A_{3} A_{2} A_{1} + B_{3} B_{2} B_{1} = C_{3} C_{2} C_{1},

where

A_{1}

and

B_{1}

denote LSDs. This ordering poses a challenge for autoregressive models because they generate outputs sequentially beginning with the MSD, thus misaligning with the natural computational process. In contrast, the reversed format

A_{1} A_{2} A_{3} + B_{1} B_{2} B_{3} = C_{1} C_{2} C_{3}

aligns the generation order with the algorithmic steps of addition. The learning task is thus simplified to computing a function that depends only on the two corresponding operand digits and the carry from the previous addition step [52,69,70].

Index Hints

Index hinting is an input augmentation technique introduced by [52] to explicitly encode positional structure into arithmetic tasks. In this method, index hints are inserted into both the query and the response. For example, arithmetic expression (42 + 39 = 81) is represented during training and inference as (a4b2 + a3b9 = a8b1), thereby enabling transformers to perform indexing through induction heads [71].

Random Space Augmentation

Ref. [70] investigated how inserting random spaces between digits in addition tasks could disrupt the model’s dependency on fixed positional cues. Their findings indicate that, while the model successfully generalized from 10-digit to 11-digit additions, its performance declined when handling even longer sequences.

Zero-Padding

Zero-padding ensures that both operands in a query have equal lengths and that the response maintains a fixed length corresponding to the operand length. In practice, padding an M-digit plus an N-digit addition with zeros reformulates the problem so that both operands have

max {M, N}

digits and the response has

max {M, N} + 1

digits. For example, the expression (653 + 49 = 702) is transformed into (653 + 049 = 0702) [54].

2.4.3. Positional Embeddings/Encodings (PE)

The difficulty of transformers in extrapolating to longer sequences is largely attributed to their positional encoding mechanisms [72]. In the following section, we examine various positional encoding strategies, with a particular focus on their capacity for length generalization.

Absolute Positional Encoding (APE)

APE incorporates positional information into transformer models by assigning each position i a unique vector

p_{i}

, which is combined with the token embedding (typically by addition) before entering the model. There are two main approaches to generating these position vectors. An approach uses a predefined sinusoidal function to produce periodic embeddings that can naturally extrapolate to unseen positions [5]. The alternative is to learn the position embeddings jointly with the model parameters, as seen in works such as [1,73,74].

Although APE offers a simple and effective mechanism for encoding position, both variants have limitations in generalizing to longer sequences. The learned version, in particular, is restricted to a fixed context window, which can hinder performance on inputs longer than those seen during training [62,75].

Additive Relative Positional Encoding (RPE)

RPE enhances the self-attention mechanism by incorporating a position-dependent bias into the pre-softmax attention logits. Originally introduced by [72], this approach modifies the keys (and optionally the values) in each attention layer. T5 further advanced the concept by mapping the relative distance between tokens to a scalar bias using a lookup table; this bias is then added to the dot product of queries and keys [76].

More recent methods build on this idea by proposing different functions for the scalar bias

b (i, j)

, which depends on the distance between positions i and j. For example, Alibi [75] subtracts a bias that grows linearly with the token distance to induce a recency bias, KerpleLog [77] uses a logarithmic function, and FIRE [78] employs a learnable MLP-based function to compute

b (i, j)

. In general, the modified attention logits can be expressed as

A_{RPE} (X) = X W_{Q} {(X W_{K})}^{⊤} + B,

where X,

W_{Q}

, and

W_{K}

denote the input and weight matrices for queries and keys, and the bias matrix

B \in R^{n \times n}

is determined by the function

b (i, j)

.

Position Coupling

Position coupling assigns position IDs that encode the structure of a task by leveraging the inherent grouping of tokens. The method involves two key steps:

Token Partitioning: The input sequence is divided into groups of consecutive tokens, where each token within a group carries a unique semantic meaning. This grouping enables one-to-one correspondence between tokens in different groups that are relevant to the task.
Position ID Assignment: For each group, a sequence of consecutive numbers (typically positive integers) is assigned as position IDs, beginning from a random number during training or a fixed number during evaluation. Tokens that represent the same significance across different groups are given the same position ID (i.e., their positions are coupled).

For example, in a decimal addition task, the expression 653 + 49 = 702 is transformed (via reversal and zero-padding) into a format like $653 + 049 = 2070$ . Here, tokens are partitioned into three groups: (1) the first operand along with the ‘+’ token, (2) the second operand, and (3) the ‘=’ token together with the sum. Position IDs are then assigned such that digits with the same significance across the operands and the sum receive the same ID. For instance, if the starting position ID is 6, the operands might be labeled 6, 7, and 8, while the sum’s digits are labeled in reversed order (e.g., 5, 6, 7, 8), with non-digit symbols receiving IDs based on their adjacency to numerical tokens (see Figure 5).

For example, a decimal addition task, the expression 653 + 49 = 702 is transformed (via reversal and zero-padding) into a format like $653 + 049 = 2070$ . Here, tokens are partitioned into three groups: (1) the first operand along with the ‘+’ token, (2) the second operand, and (3) the ‘=’ token together with the sum. Position IDs are then assigned such that digits with the same significance across the operands and the sum receive the same ID. For instance, if the starting position ID is 6, the operands might be labeled 6, 7, and 8, while the sum’s digits are labeled in reversed order (e.g., 5, 6, 7, 8), with non-digit symbols receiving IDs based on their adjacency to numerical tokens (see Figure 5).

Randomized Position Encoding

Randomized PE [63], a method that enhances traditional positional encodings by sampling from a range that extends beyond the typical test-time length while maintaining token order. This training strategy enables transformers to adapt to larger positional encodings, thereby effectively mitigating issues with out-of-distribution position encodings during testing.

No Positional Encoding (NoPE)

Encoder-only transformers, such as BERT [73], maintain invariance to the order of tokens even without positional encodings. In contrast, decoder-only models using causal attention have been shown by [79] to develop positional awareness on their own without explicit PE. Moreover, ref. [62] have recently demonstrated that, for simple algorithmic tasks, models without any positional encoding can outperform those that employ specialized positional encoding techniques.

Rotary Positional Encoding (RoPE)

RoPE, as introduced by [80], incorporates positional information into the attention logits by applying a rotational transformation to the query and key vectors based on their relative positions. Although this method is simple and effective, its ability to generalize to longer sequences remains limited [62,75]. Extensions like Position Interpolation [81,82] can extend the context length of RoPE, but do not necessarily enhance its generalization performance on algorithmic tasks where understanding the underlying algorithm is critical.

Having introduced the main positional encoding schemes and data formatting strategies, Figure 6 illustrates how representative studies in the length generalization literature combine these approaches for the integer addition task.

3. Results

The results of the methods discussed above are presented in this section in the same order as they were introduced in Section 2.

3.1. Memorization of Random Sequences

There are multiple ways to quantify memorization in learning experiments. In the maximum library size (MLS) approach, the objective is for the network to memorize every element in an input vector library. Here, the capacity is assessed by determining the largest library that can be completely memorized. In contrast, the maximum attainable capacity (MAC) approach trains the model on a large library and aims to identify the highest number of samples the network is able to memorize. Compared to the MAC method, the MLS method is clearly more computationally intensive.

3.1.1. Capacity in the Two Formulations

The two ways of measuring memorization capacity are compared in Figure 7. The curves show similar overall trends as a function of the size of the embedding vector B. However, the MAC estimates have a bias which is caused by the possibility to generate the correct token by chance, as discussed above. For example, in

K =

32,000 sequences of a token count of

T = 128

Equation (8) gives

K / T = 250

, which corresponds to the difference between the MLS and MAC curves.

For larger vector dimensions B, the MAC capacity, once corrected for offset, appears to fall short of the estimates of MLS-based capacity. Consequently, we may treat the MLS results as a lower bound on storage capacity. Because the MAC setting corresponds more closely to practical applications, this paper focuses on modeling the outcomes of MAC experiments.

The likelihood that model training requires k epochs before the model shutters (MLS condition) is expected to be described by a negative binomial distribution given by

P_{r} (X = k) = (\binom{k + r - 1}{k}) {(1 - p)}^{k} p^{r}

(9)

For instance, Figure 8 depicts the histogram of the number of epochs required for a transformer model to shutter, meaning to perfectly learn all sequences, across 1000 independent runs, in the setting of 16 sequences of length

N = 8

and a token vector dimension

d = 16

, resulting in a total of

2 \times d^{2} = 512

parameters.

3.1.2. Impact of Batch Size

Figure 9 shows how many MAC sequences of length 32 or 128 tokens can be memorized as a function of batch size over long training runs. For small batch sizes, memorization remains low, which aligns with the large variability observed in the gradient noise. As the batch size increases, the memorization capacity rises. For the model sizes tested in this work the capacity levels off once the batch size reaches 512.

In these experiments, the transformer model was implemented using the PyThon x-transformers (pypi package 2.18.9) library [83] and trained with the PyTorch Adam optimizer [7] based on the default hyperparameter configuration of the library. The batch size can be chosen based on noise gradients similarly to past studies, for example, in [84]. In this paper, we employ a batch size of 512. Overall, we trained roughly 500 models, requiring a total of 260 h on an Nvidia A100-SXM4-40GB GPU. Each training configuration was run five times, and we chose the run that yielded the largest number of memorized vectors. Training for a given run was stopped once the number of memorized sequences failed to improve over multiple consecutive epochs.

Figure 10 reports the number of fully memorized sequences in MAC experiments performed on a single-layer network with

H = 1

attention head, evaluated across different vector dimensions B. All experiments presented in this study are conducted using a fixed library of 16,000 sequences. The core model employs absolute positional embeddings for token representations, and the dimensionality of the attention head is set to

d_{h} = 128

, a configuration that is common in several open-source LLM models.

Figure 11 show the measured and predicted capacity values for various transformer model architectures in the reported experiments.

3.1.3. Empirical Capacity Model

The experimental results were employed to construct an empirical capacity model (ECM) for self-attention transformers. This model characterizes how the number of sequences that can be memorized depends on the transformer’s hyperparameters. In [20], the following model was proposed:

C = m a x (f (H, N) * B, α * H + β)

(10)

where

α

and

β

are parameters that model the rise in memorized tokens with respect to H and

f (N, H)

is an approximation of the slope defined as

f (N, H) = \frac{a}{N^{b * H + c} + d} + e .

(11)

The parameters

a - e

were learned from the data from the experiments described above. The change in slopes as a function of N are visualized in Figure 12. We observe that the slope decreases exponentially as N increases, which motivates using a generalized rational function with a power-law form in the denominator. The parameters in models of one and two layers derived by applying the ECM to the experimental data above are given in Table 2.

The graphical representation of how this ECM fits the data is shown in Figure 11, and the corresponding empirical measures are provided in Table 2. To preserve interpretability, the model was developed incrementally by analyzing the marginal contribution of each input variable and then approximating these effects with low-order algebraic expressions. This design yields a compact and explainable formulation with relatively few parameters. The slope-based formulation allows for a straightforward interpretation of both input variables. N captures the exponential decline in the rate at which sequences are memorized as a function of B, while H modulates how quickly this decline occurs, with larger values of H leading to a less steep reduction.

One way to benefit from the ECM is to use it to compare different architecture choices. Figure 13, for example, shows the total capacity of the model, for all data sequences with

N = 64

, as a function of the total number of coefficients. The plot indicates that performance improves markedly as the number of attention heads increases from

H = 1

to 4. The memorization capacity of a four-head model with 2 M parameters, for example, is nearly two times higher than that of a

H = 2

model, i.e., 1600 and 3300, respectively. One can see that this is also in agreement with the theoretical prediction of

O (H N)

given in [19].

The ECM also allows us to analyze the cost of introducing more parameters. To do this, we first introduce a metric that connects the parameter count to the model’s capacity. In particular, we define memory per parameter (MPP) as the number of tokens memorized per parameter in the model:

M P P = \frac{m a x (f (H, N) * B, α * H + β)}{# p a r a m e t e r s}

(12)

By holding a single hyperparameter fixed and averaging the MPP over the remaining ones, we gain insight into the cost of adding more parameters to the model. We observe that, according to the MPP, smaller networks make more efficient use of their parameters than larger networks. In addition, increasing the number of heads leads to a smaller reduction in efficiency than increasing N. Therefore, the parameters associated with the heads are less computationally costly than those associated with N.

3.2. Memorization of Synthetic Sentences

In the second experiment described in Section 2.2, we used synthetic text content derived from the SNOMED KG.

3.2.1. Dataset Size Influence

Figure 14 illustrates the trend in capacity and accuracy between dataset sizes in the first setup. Smaller triplet datasets converge rapidly, with both accuracy and capacity rising steeply during the first 5–6 epochs and saturating around the epoch 20. Larger datasets improve more slowly during the first 15 epochs but ultimately reach a substantially higher final capacity. In this setup, the transition becomes especially visible around

70,000

samples, after which the number of epochs required for near-complete memorization increases noticeably.

The final results in Table 3 confirm this trend. Although the

50,000

-sample dataset reaches a higher early accuracy, its final capacity remains below the library size (

46,811 \pm 149

). By contrast, the larger

100,000

-sample dataset reaches the

86,776 \pm 2484

memorized samples. Thus, larger libraries improve attainable memorization, but they also make optimization slower and leave a non-trivial fraction of samples unmemorized under the present training budget. For smaller libraries, the reasons behind the unlearned data, despite the available capacity, remain unclear.

3.2.2. Architectural Variations

In the second setup, the batch size increased from 64 to 128, and the models reached higher capacities than in the first setup under otherwise comparable triplet-based conditions. Across the activations tested, Softmax showed the most stable training behavior, the highest average capacities, and fewer visible outliers throughout the training process (Figure 15). Notably, four-layer models with Softmax achieved capacities comparable to one- and two-layer models without an evident loss in convergence speed.

In contrast, ReLU and RReLU showed greater variability across depths and dataset sizes. Their final capacities tended to decrease as the number of layers increased, and the corresponding training curves exhibited less consistent growth during optimization. GELU followed a similar overall pattern, although in some configurations with larger datasets, it showed slightly faster improvement during the early training stages.

As in the first setup, the dataset size strongly affected the training dynamics. Larger datasets required a longer warm-up phase and initially achieved lower capacities than smaller datasets trained under the same architectural conditions. However, they continued to improve for longer and, in several cases, reached higher final capacities at the end of the training. Overall, Figure 15 indicates that the memorization behavior in this setup depends jointly on the activation function, architectural depth, and dataset size.

3.2.3. Embedding-Size Influence Under a Fixed Budget

The third experiment further showed that, for these triplet datasets, the learning speed appears to depend more strongly on the embedding size than on the depth. Configurations with the same effective per-layer embedding size followed very similar trajectories even when the number of layers differed. For example, as shown in Figure 16, the 1-layer configuration with

d_model = 16

converged at nearly the same rate as the 2-layer configuration whose effective per-layer d_model was also 16. Similar behavior was observed for embedding sizes 32 and 64.

The results suggest that the embedding size is the key factor that influences the learning speed, while the addition of layers without increasing the embedding size neither accelerates the convergence, nor improves the final capacity. Moreover, additional layers often slow the training, as evidenced by the faster growth of accuracy of one-layer models, see, Figure 16. As in the previous experiments, using a smaller embedding dimension slowed down learning even more. Ultimately, though, all setups seem to converge to a comparable accuracy, suggesting that the dataset’s simplicity makes embedding size the primary factor shaping the training dynamics.

The final capacities were also very close across most configurations. At dataset sizes of 1000,

10,000

and

50,000

samples, the one- and two-layer models achieved nearly identical MAC values. However, in the

100,000

samples, a capacity “barrier” emerged. The Two-layer configuration (effective per-layer

d_m o d e l = 8

, corresponding to the setting 16 under the fixed-budget rule) reached

85, 935 \pm 153

, compared with approximately

88, 200

for the remaining settings and

88, 240 \pm 62

for the corresponding one-layer model. This suggests that larger datasets, smaller embeddings, and deeper architectures may introduce limitations due to slower convergence or suboptimal capacity utilization.

3.2.4. Insights from Sequence Datasets

In the fourth setup, memorization was assessed by testing its ability to memorize each node in a sequence from the full preceding context of nodes and edges, rather than from a single triplet. The resulting prediction counts were

34, 908

,

85, 972

, and

167,965

predictions for datasets of 20, 50, and 100 thousand sequences, respectively.

Compared to triplet datasets, models trained on sequences converged to near-perfect memorization in markedly fewer epochs, saturating within approximately 150 epochs (Figure 17). The richer relational context encoded in each sequence appears to have accelerated learning, though it also extended per-epoch training time and introduced larger capacity fluctuations across epochs, an expected consequence of the greater pattern complexity relative to triplets. Nonetheless, models demonstrated exceptional memorization, achieving

100 %

capacity on the 20 thousand sequence dataset and above

99.5 %

on the 50 and 100 thousand datasets.

Consistent with earlier results, RReLU converged more slowly than Softmax, yet the two activations produced nearly the same final capacity in one- and two-layer configurations. On the 100 thousand sequence dataset, RReLU yielded

166,934 \pm 243

(one layer) and

166,995 \pm 118

(two layers), while Softmax reached

166,992 \pm 110

and

166,985 \pm 904

, respectively. In deeper models (four layers), the picture shifted: RReLU achieved lower final capacity with greater variance (

165,271 \pm 1068

) compared with Softmax (

166,825 \pm 319

). This contrasts with previous findings [85], which reported that ReLU outperformed Softmax. The discrepancy points toward a sensitivity of activation function effectiveness with both dataset structure and task type, warranting further investigation. However, even with increased sequence complexity, all models adapted rapidly and attained strong memorization.

3.3. Generalization and Abstraction

In this section, we preview the results of training the models on the input sequences and evaluating them using the metrics introduced in Section 2.3. In addition to the metrics, we preview the loss of the models, as well as specific attention heads based on their significance level.

3.3.1. Results for the One Layer and Eight Heads

Figure 18 shows the training loss in four runs of the 1-layer, 8-head model. We can observe that all the runs fail to converge to a high minimum loss. This failure is also reflected in other metrics, in Figure 19, which shows that the model fails to solve the last digit prediction task. In contrast, we can see in Figure 20 that the model performs well in the variable matching task.

3.3.2. Results for the Two Layers and Four Heads

In Figure 21, we can observe that almost all runs converge to a training loss of 0.68, except the run in pink. This is supported by Figure 22 which shows that the pink run was the only model that did not achieve perfect accuracy in the last digit task. Figure 23 demonstrates how all the models were able to solve the pattern matching task.

To illustrate the model’s behavior correctly, we present an example from one model with the 2-layer, 4-head architecture, specifically the model represented by the purple run. The model’s predicted sequence and the ground-truth sequence are shown below:

Predicted sequence

[? 3 3 3 1 1 3 3 | 9 6 1 2 9 7 6 7 | 9 9 1 5 8 8 9 8 | 1 9 3 3 8 5 9 5 | 3 1 7 | 6 2 7 | 9 5 8 | 9 3 5]

Correct sequence

[3 3 1 1 7 7 3 7 | 6 6 2 2 7 7 6 7 | 9 9 5 5 8 8 9 8 | 9 9 3 3 5 5 9 5 | 3 1 7 | 6 2 7 | 9 5 8 | 9 3 5]

To better understand how the model performs abstract prediction, we visualize the attention patterns of one of the attention heads. Figure 24 shows that most tokens in the second, third, and fourth instantiations attend back to corresponding positions in the first instantiation when predicting the next token.

3.3.3. Results for the Three Layers and Two Heads

Figure 25 shows the training loss of different runs of the 3-layer, 2-head model. Although all runs converge to training loss (0.68), we also observe that some runs reached earlier than others, such as the run in pink being the latest. We can also see a correspondence between Figure 25 and Figure 26, as the number of steps it takes for the green and pink run to reach the minimum training loss, is close to the number of steps it takes when a sudden rise occurs in the last-digit accuracy metric. Figure 27, shows how the runs of this model architecture perform well in the variable matching task, similar to the other model architectures.

To further evaluate the 3-layer, 2-head architecture, we present an example from one of the runs, similar to what we did in the 2-layer, 4-head architecture model:

Predicted sequence

[? 3 3 1 3 3 1 3 | 7 6 4 2 5 7 6 7 | 4 9 7 5 6 8 9 8 | 7 9 6 3 1 5 9 5 | 3 1 7 | 6 2 7 | 9 5 8 | 9 3 5]

Correct sequence

[3 3 1 1 7 7 3 7 | 6 6 2 2 7 7 6 7 | 9 9 5 5 8 8 9 8 | 9 9 3 3 5 5 9 5 | 3 1 7 | 6 2 7 | 9 5 8 | 9 3 5]

Additionally, Figure 28 shows one of the attention heads in one of the runs that shows a pattern that we will discuss in the next section.

3.4. Generalization in Algorithmic Tasks

In our investigation of transformer generalization on the integer addition task, we observe that the ability to extrapolate to sequences longer than those seen during training is highly sensitive to both the choice of positional encoding and the adopted data formatting strategy. Standard transformers—with conventional absolute positional encodings (APE)—often fail to generalize beyond the training range, typically achieving minima extension (approximately 1×) when evaluated on longer-digit addition problems [62]. In contrast, several modifications have been proposed to overcome this limitation.

3.4.1. Enhancing Generalization via Positional Encoding and Data Formatting

Recent work by [53] demonstrates that by carefully tailoring the data format (using a reversed digit order and explicit index hints [52]) and incorporating an expressive positional encoding—namely FIRE (a learned additive bias function) [78]—a standard transformer trained on addition tasks up to 40 digits can successfully generalize to 100-digit additions. This yields an extension ratio of approximately 2.5×. The study shows that the combination of reversed formatting (which aligns the computation order with the natural progression of carry propagation [52]) and index hints (which help the model pinpoint the relevant digit positions [71]) plays a crucial role in this improvement.

Similarly, ref.[86] explore the use of relative positional encodings. Their findings indicate that for simple tasks such as addition, training on 5-digit numbers can lead to correct 15-digit computations—implying a roughly 3× extension. The relative positional framework appears to mitigate the dependency of the model on absolute token positions by focusing on the invariant relations between digits.

3.4.2. Leveraging Task Structure: Position Coupling and Structural Symmetry

Other approaches directly embed the inherent structure of the arithmetic task into the model’s design. Ref. [54] introduced the concept of position coupling, where digits of the same significance (for example, all least significant digits across operands) are assigned the same position identifier. This modification allows a 1-layer transformer, trained on addition problems with operands ranging from 1 to 30 digits, to generalize to problems with up to 200 digits—corresponding to an extension factor of approximately 6.67×. The theoretical analysis further shows that such coupled positional representations are necessary for solving the addition task over exponentially many digits.

In another line of work [87], proposed explicitly encoding the inherent structural symmetry of arithmetic problems. By modifying the number formatting and designing custom positional encodings that capture the right-to-left symmetry (i.e., aligning digits by their significance), their method enables a transformer trained on numbers with at most 5 digits to successfully perform 50-digit addition. This approach achieves an impressive 10× extension and emphasizes that when the task-specific structure is explicitly incorporated, the model can overcome the limitations of conventional encoding schemes.

Table 4 summarizes the key approaches discussed above along with their main mechanisms, the training range, and the resulting length extension factor.

4. Discussion

The experiments reported in the previous section illustrate different ways in which the architectural parameters and the properties of training data influence the memory capacity and generalization in compact transformer systems.

4.1. Memorization of Random Sequences

The capacity of self-attention networks to memorize and generalize the training data can be modeled and characterized analytically or empirically, as in the current paper. Obviously, both approaches are useful to optimize the performance of machine learning models. Analytical characterization focusing on theoretical bounds helps to explain the role different elements of the models and show what capability, under idealized conditions, is possible. Back envelope calculations already show that there is a large gap between the expected theoretical and practically attainable memorization capacity of large transformer models. One of the goals of the current work is to quantify this difference and in this way point out new opportunities in the future development of language modeling algorithms. Secondly, the goal was to quantify the performance of the current learning algorithms and show what type of optimization is possible in cases where the requirements of the use case, for example, in a healthcare application, can be explicitly defined.

The experiments relied on an autoregressive task in which the transformer is required to predict the next token, given a previously observed sequence of tokens. The count of correctly predicted sequences was used as a proxy for the model’s capacity. We trained models to characterize network capacity at multiple locations in the hyperparameter space. The outcomes from these models formed the foundation for a subsequent model designed to capture how each hyperparameter influences behavior. In particular, we combined these insights to derive the model presented in Equation (10). Owing to its simplicity and small number of parameters, this model is easy to interpret and achieves better performance than more complex, higher-order polynomial alternatives.

4.2. Memorization of Synthetic Sentences

The experiments in this section investigated how decoder-only transformer architectures encode structured knowledge derived from a real-world medical ontology. Rather than focusing on generalization performance, our aim was to conduct a controlled, isolated study of memorization behavior, establishing a proof-of-concept framework that connects formal theoretical bounds to empirical measurement. The complete SNOMED KG contains more than a million relations, integrating broad areas of medicine, including substances, diseases, and anatomical structures. However, practical edge deployments, such as small transformers embedded in smart glasses or wearable health monitors, demand that models store only a carefully scoped subset of this knowledge. For instance, smart glasses used by a cardiac surgeon or a smartwatch offering personalized dietary guidance may need a domain-specific LLM capable of storing roughly 100 to 100,000 distinct items. As discussed in Kajitsuka and Sato [18], isolating memorization is a valid objective that reveals the maximum amount of information a transformer can dependably store under a given architectural configuration. Our methodology reflects this: we hold generalization and test-time reasoning constant and instead examine how dataset properties and architectural decisions jointly determine convergence behavior and storage capacity.

To ensure clear capacity measurement, tasks were selected such that correct memorization has a single, verifiable ground truth. Increasing complexity would introduce overlap between memorization and generalization, making interpretation less fair and direct.

4.2.1. Effect of Dataset Structure

Training set size produced a clear trade-off: smaller datasets led to faster convergence but lower capacity, while larger corpora required extended warm-up but reached substantially higher memorization capacity. Interestingly, under identical training conditions, larger sets initially led to lower capacities than smaller datasets under the same training conditions but continued to improve over a longer period. This pattern points to the presence of qualitatively distinct learning regimes whose dynamics are shaped by the interplay of dataset scale, network depth, and activation function choice. Beyond a certain size threshold, the training slowed significantly, suggesting the emergence of optimization bottlenecks. The fact that some samples remain unlearned despite sufficient nominal capacity indicated the possible influence of local minima or other gradient-level barriers.

Sequential data training consistently outperformed triplet-structured inputs, reaching near-complete memorization in substantially fewer epochs. Sequences improved learning by encoding relational structure and inter-element dependencies directly into the input, though this benefit came paired with elevated variance in the training signal, an observation that aligns with the findings of Ju et al. [88]. Extended traversal sequences may therefore offer further memorization gains in narrowly scoped clinical domains.

The complexity of the sequence datasets was controlled through BFS depth and edge count, enabling the resulting samples to reflect both fine-grained local transitions and broader global structure within the SNOMED graph (e.g., connections linking anatomical entities to associated clinical procedures), while avoiding trivially linear or purely synthetic patterns. Randomness was balanced with structural constraints such as bidirectional edge inclusion and node uniqueness, mirroring the directional reasoning chains characteristic of clinical knowledge (e.g., from presenting symptom through differential diagnosis to treatment decision).

4.2.2. Architectural Influence

Among the architectural variables examined, embedding size was the primary factor of both learning speed and capacity, whereas adding more layers often reduced performance. A plausible interpretation is that the triplet-based task presents a level of structural regularity that does not benefit from deeper hierarchical processing: additional layers introduce parameters without providing substantial benefits for capturing the relevant patterns. This is consistent with the findings that some transformer layers may be redundant and can be pruned without a major loss in performance [89]. While redundancy and pruning were not directly investigated here, the observed depth sensitivity suggests that such compression strategies may be promising for further optimization of compact memorization-oriented models.

When the dataset size increased, smaller embeddings frequently failed to saturate available capacity, particularly in multi-layer configurations, suggesting that widening the embedding space may be more beneficial than adding depth, at least in the context of structured, domain-specific memorization.

Across activation functions, Softmax yielded the most stable training trajectories and the highest capacity overall. ReLU and RReLU introduced considerably more variance, with notable performance degradation in deeper models, which is consistent with Paik and Choi [90], Chen and Ge [91]. In particular, both functions exhibited less consistent learning dynamics, including slower or less steady improvements in capacity during training, in line with observations by Fu et al. [92]. GELU followed a similar overall trend, although in some configurations it performed better during the early training stages on larger corpora. At the same time, our results differ from Shen et al. [85], whose experiments favored ReLU, suggesting that activation function effectiveness is sensitive to dataset structure, weight initialization, and the precise formulation of the learning objective.

Synthesizing across these experiments, shallow decoder-only architectures of one to two layers paired with wider embeddings appear to offer the most favorable balance between training efficiency and attainable memorization capacity for structured, domain-constrained tasks. This makes them a plausible design choice for compact, domain-specific deployments in which local processing and model size are important, edge health monitors and low-power clinical decision support tools being prime examples. It should be noted, however, that these conclusions are architectural rather than hardware-level: latency, energy consumption, quantization effects, and ASIC-level performance were not directly measured and remain important directions for future work.

4.3. Generalization and Abstraction

In this section, we discuss the results we presented in the previous section, focusing on the analysis of the different model architectures on the different metrics, in addition to the analysis of the predicted input sequence behavior and the attention heads.

4.3.1. Interpreting the One Layers and Eight Heads

The results of the 1-layer, 8-head architecture show a limitation of shallow transformer models. Across training runs, the model fails to perform well in the last-digit prediction task. However, it performs well in the variable matching task, which might be an indicator to how models with different architecture perform in generalization vs memorization tasks, as the variable matching task might require memorization abilities. The model learns consistently how to remember variable assignments. However, it is unable to transfer the abstract template across the instantiations, which suggests that abstraction requires a hierarchical composition of representations that cannot be done with a single attention layer. Increasing the number of attention heads does not replace the necessity of having extra layers. The observation supports a key conclusion for our study: abstract in-context generalization requires architectural depth.

4.3.2. Interpreting the Two Layers and Four Heads

In contrast to the 1-layer model, the 2-layer model architecture consistently shows the ability to generalize abstract templates in-context. Most training runs reach near-perfect performance in the last digit prediction task. This performance is also reflected in the predicted sequence that we demonstrated, where we can see that the model adjusts the predictions based on the information it receives. We can see that the model starts with predicting the tokens it has seen only. However, in the second, third and fourth instantiations, it starts learning and predicting on the go, where it starts correctly predicting a token in the correct position, whenever it has the sufficient amount of information to do so. Similarly to the 1-layer model, this model architecture performs well in the variable matching task almost instantly, which is also reflected in the predicted input sequence example we provided.

4.3.3. Interpreting the Three Layers and Two Heads

The 3-layer architecture stabilizes the abstraction behavior observed in the 2-layer model. Across training runs, the model reliably reaches near-perfect accuracy on both tasks in all its runs. Compared to the 2-layer configuration, the emergence of generalization of abstract patterns occurs more consistently. Similarly to the example of the predicted sequence from the 2-layer model, this model also correctly predicts the correct token once enough information is present.

4.3.4. Abstraction Heads

Attention analysis across the successful 2-layer and 3-layer models, in Figure 24 and Figure 28, reveals the emergence of a specialized mechanism that supports abstract template transfer. This behavior can be seen visually as a diagonal pattern of attention looking back at the first or previous instantiation, which provides evidence that the model might be attending to those instantiations in an abstract way. We refer to this mechanism as an abstraction head. An abstraction head, AH, is an attention head that retrieves the abstract successor of a token by aligning positions across instantiations of the same template, rather than copying local token transitions.

This behavior is related to the concept of induction heads [93], which complete the pattern by extending previously observed token relationships, such as

[A^{*}] [B^{*}] \dots [A] \to [B]

, where

A^{*} \approx A

and

B^{*} \approx B

. However, the abstraction head identified in this study operates at a higher abstraction level, focusing on pattern matching according to a template that was not available in the training phase.

4.4. Generalization in Algorithmic Tasks

Collectively, these studies on generalization described in Section 2.4 indicate that while a standard transformer architecture is prone to overfitting the training length distribution, targeted modifications to positional encoding and data representation can dramatically boost length generalization. For example, incorporating task-specific modifications—such as reversed token orders, explicit index hints, and position coupling—has enabled models to generalize to lengths 2.5–6.67× beyond their training range. At the same time, our observations of the grokking phenomenon (i.e., a sudden phase change from memorization to generalization; see [94]) reveal that such improvements come with trade-offs. Many enhanced methods remain sensitive to initialization and training order, and the robustness of generalization varies substantially between different random seeds [53]. Together, these findings underscore the need for more work to improve stability and understanding of the precise mechanisms by which structural and representational modifications affect both memorization and generalization.

The summary of this section, as illustrated throughout our analysis, underscores the evolving role of transformer models in advancing beyond pattern recognition to algorithmic reasoning. By exploring the generalization capabilities of transformers, this study distinguishes between length generalization and compositional generalization.

The survey highlights key methodologies that have been proposed to enhance transformer length generalization in the integer addition task, including advanced data formatting strategies and specialized positional encoding techniques. These approaches, such as position coupling and structural symmetry encoding, represent incremental advancements that progressively improve model performance in out-of-distribution settings.

Despite these advancements, significant research opportunities remain in improving the robustness and adaptability of transformers, particularly in scaling their generalization to broader, more complex tasks. Transformer-based reasoning is also extending into new domains, including large-scale symbolic computation and real-world problem-solving, requiring novel adaptations to handle dynamic and structured data effectively. This expansion highlights the growing practical implications of transformer generalization, drawing increasing attention from both academic and industry sectors.

Despite the comprehensive analysis presented in this section, several limitations must be acknowledged:

Task Scope: This section primarily focuses on controlled algorithmic tasks—such as integer addition—to illustrate the challenges of transformer generalization. Although these tasks provide clear insights into the mechanisms of length and compositional generalization, they may not fully capture the complexities encountered in real-world applications.
Emphasis on Length Generalization: Although our taxonomy differentiates between length and compositional generalization, the discussion and empirical focus have largely centered on length generalization. The dynamics of compositional generalization, particularly in the context of large-scale reasoning models, requires further in-depth exploration.
Sensitivity to Experimental Settings: Many of the techniques reviewed, including specific data formatting strategies and positional encoding modifications, are sensitive to factors such as random initialization, training order, and hyperparameter choices. This sensitivity may limit the reproducibility and robustness of the improvements reported in diverse settings.
Evolving Landscape: The field of transformer research is rapidly evolving. New architectural innovations and training methodologies continue to emerge, which may not be fully captured in our current analysis. Future work will need to continuously update the survey framework to integrate these advances.

5. Conclusions

This work was inspired by the understanding that current LLM systems are very large and obviously superfluous for most practical tasks, especially in embedded and on-edge use cases and limited domains. The studies presented in this paper address the memorization, generalization, and architectural efficiency of transformer-based models, with a particular focus on their potential for on-edge and domain-specific applications. The literature and the results of this paper and similar studies suggest that Application-Specific Integrated Circuits, ASIC, when scaled and optimized for a target requirement of capacity and generalization can lead to solutions that are orders of magnitude more efficient than large general purpose LLMs running in programmable devices.

The actual gains in using application specific ASIC designs in LLM system is a subject of future study and is not detailed in this paper. The core LLM technology based on the transformer architecture [5] seems mature and alternative architectures such as LSTM [95], state-space models [96], and xLSTM [97] do not currently reach similar performance and efficiency.

Our experiments demonstrate that the memorization capacity of transformers can be empirically modeled as a function of hyperparameters such as embedding size B, number of attention heads H, and sequence length N. The Empirical Capacity Model, ECM, provides a practical tool for estimating the storage potential of transformer architectures, enabling more efficient design choices for specific applications. The capacity scales with

O (N H)

, aligning with theoretical bounds but also revealing that embedding size and activation functions (e.g., Softmax) often have a more significant impact than architectural depth in simpler tasks. The results point towards the feasibility of deploying compact, ASIC-optimized transformers for edge devices, such as wearables or clinical decision-support tools.

Although the ECM provides a useful heuristic, more work is needed to refine its predictive power for diverse data modalities, e.g., multimodal, sparse, or noisy datasets, and real-world applications in healthcare or robotics.

Beyond memorization capacity, our experiments reveal a qualitative distinction between storage and abstraction. Although shallow transformers can track symbolic correspondences, they fail to generalize abstract templates across instantiations. In contrast, multi-layer architectures show that once sufficient depth is present, the model develops a specialized attention mechanism, abstraction heads, which enable in-context symbolic generalization. This finding suggests that reasoning in transformers might not be a mere extension of memorization, but an emergent capability that requires architectural depth.

Several directions are worth pursuing in follow-up work. Covering the wider hyperparameter space, including layer count, would yield denser capacity estimates and clarify how depth interacts with the other variables studied here. Extending the dataset to include natural language content would move the capacity model closer to real deployment conditions and help check whether the conclusions hold outside synthetic settings. Principled guidelines for a priori hyperparameter selection, derived from the empirical capacity model, are another concrete next step.

We also examined how architectural configuration and dataset structure affect a transformer’s ability to store structured knowledge, using the SNOMED knowledge graph as a real-world benchmark. Key findings show that embedding size and activation function have more impact than the network depth, while larger datasets increase memorization capacity but require longer training. Triplet-structured data worked well in simpler models, whereas sequential data reached higher capacity ceilings but introduced more training instability. Layer-level efficiency, compression, and the boundary between memorization and generalization all remain open challenges. These are especially relevant for small transformers on smart devices, where models need to store specialized knowledge within tight computational budgets.

Furthermore, our examination of algorithmic tasks reveals a distinction between capacity and reasoning. We found that standard transformer architectures, while capable of memorizing large amounts of static data, struggle to achieve length generalization, typically failing to extrapolate to sequences longer than those seen during training (an extension factor of only

\sim 1 \times

) [62]. However, our review demonstrates that this limitation is not absolute; targeted modifications such as position coupling [54], explicit structural symmetry [87], and reversed data formatting [53] can unlock the ability to generalize up to

10 \times

beyond training lengths. This indicates that while embedding size drives memorization capacity, the model’s ability to perform algorithmic reasoning is heavily dependent on how positional information and data structure are encoded.

Although ASICs offer dramatic efficiency gains, their fixed architecture limits adaptability. Research should focus on dynamic circuit reconfiguration for multi-task edge devices, quantization-aware training to balance capacity and hardware constraints, and hybrid architectures (e.g., combining transformers with lightweight symbolic engines) for tasks requiring both memorization and reasoning.

The experiments with synthetic data make it possible to measure the capacity and generalization in an unambiguous way, which is notably challenging in the case of natural language. The current study could be extended in several ways to contain more realistic text representations and a more detailed analysis of the variability of the performance obtained in different training and evaluation conditions.

The ultimate goal is to develop systems that memorize efficiently, generalize robustly, and are sustainable in use. Some of the future research could look into combining transformers with symbolic reasoning for abstract tasks, use of neuromorphic circuit designs for energy-efficient learning and inference. This work may also benefit from a better understanding of how humans balance memorization and generalization. As the use cases of transformer architectures continue to evolve, their scalability, interpretability, and adaptability become the focus of new developments. The most efficient LLM systems may not be the largest, but those that are precisely shaped by their purpose—like a key cut for a lock. How might we design transformers that are not just large, but not larger than they need to be?

Author Contributions

Conceptualization, A.H., A.C. and D.V.; Methodology, A.H. and D.V.; Software, A.H., A.A.-S., A.C., D.V. and M.P.; Writing—original draft, A.H., A.A.-S., A.C., D.V., M.P. and A.W.; Visualization, A.A.-S., A.C., D.V. and M.P.; Supervision, A.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors wish to thank SURF (project EINF-12032) for support in cloud computing resources. Some of the content in this article, i.e., [21,22], has previously been presented in the conferences of the Association for Computational Linguistics.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; et al. Holistic Evaluation of Language Models. Trans. Mach. Learn. Res. 2023. Available online: https://collaborate.princeton.edu/en/publications/holistic-evaluation-of-language-models/ (accessed on 1 March 2026).
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision; Technical Report. 2023. Available online: https://proceedings.mlr.press/v202/radford23a.html (accessed on 1 March 2026).
Liu, Y.; Zhang, Y.; Wang, Y.; Hou, F.; Yuan, J.; Tian, J.; Zhang, Y.; Shi, Z.; Fan, J.; He, Z. A Survey of Visual Transformers. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 7478–7498. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Bahdanau, D.; Cho, K.H.; Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference Learning Representations, ICLR’15, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, ICLR’17, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Gupta, B.; Ta, P.; Ram, K.; Sivaprakasam, M. Comprehensive Modeling and Question Answering of Cancer Clinical Practice Guidelines using LLMs. In Proceedings of the 2024 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Natal, Brazil, 27–29 August 2024; pp. 1–8. [Google Scholar] [CrossRef]
Wu, J.; Liang, X.; Bai, X.; Chen, Z. SurgBox: Agent-Driven Operating Room Sandbox with Surgery Copilot. In Proceedings of the 2024 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 15–18 December 2024. [Google Scholar]
Balloccu, S.; Reiter, E.; Li, K.J.H.; Sargsyan, R.; Kumar, V.; Reforgiato Recupero, D.; Riboni, D.; Dusek, O. Ask the experts: Sourcing a high-quality nutrition counseling dataset through Human-AI collaboration. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 11519–11545. [Google Scholar] [CrossRef]
Kang, B.J.; Lee, H.I.; Yoon, S.K.; Kim, Y.C.; Jeong, S.B.; O, S.J.; Kim, H. A survey of FPGA and ASIC designs for transformer inference acceleration and optimization. J. Syst. Archit. 2024, 155, 103247. [Google Scholar] [CrossRef]
Wang, Y. Artificial-Intelligence integrated circuits: Comparison of GPU, FPGA and ASIC. Appl. Comput. Eng. 2023, 4, 99–104. [Google Scholar] [CrossRef]
Kuon, I.; Rose, J. Measuring the Gap Between FPGAs and ASICs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2007, 26, 203–215. [Google Scholar] [CrossRef]
Fasi, M.; Mikaitis, M. Algorithms for Stochastically Rounded Elementary Arithmetic Operations in IEEE 754 Floating-Point Arithmetic. IEEE Trans. Emerg. Top. Comput. 2021, 9, 1451–1466. [Google Scholar] [CrossRef]
Arar, E.M.E.; Sohier, D.; de Oliveira Castro, P.; Petit, E. The Positive Effects of Stochastic Rounding in Numerical Algorithms. In Proceedings of the 2022 IEEE 29th Symposium on Computer Arithmetic (ARITH), Virtual, 12–14 September 2022; pp. 58–65. [Google Scholar] [CrossRef]
Liu, F.; Chao, W.; Tan, N.; Liu, H. Bag of Tricks for Inference-time Computation of LLM Reasoning. In Proceedings of the Advances in Neural Information Processing Systems, Sydney, Australia, 6–12 December 2025; Volume 38. [Google Scholar]
Kim, J.; Kim, M.; Mozafari, B. Provable Memorization Capacity of Transformers. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Kajitsuka, T.; Sato, I. On the Optimal Memorization Capacity of Transformers. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Mahdavi, S.; Liao, R.; Thrampoulidis, C. Memorization Capacity of Multi-Head Attention in Transformers. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Härmä, A.; Pietrasik, M.; Wilbik, A. Empirical Capacity Model for Self-Attention Neural Networks. arXiv 2024, arXiv:2407.15425. [Google Scholar] [CrossRef]
Changalidis, A.; Härmä, A. Capacity Matters: A Proof-of-Concept for Transformer Memorization on Real-World Data. In Proceedings of the First Workshop on Large Language Model Memorization (L2M2); Jia, R., Wallace, E., Huang, Y., Pimentel, T., Maini, P., Dankers, V., Wei, J., Lesci, P., Eds.; Association for Computational Linguistics: Vienna, Austria, 2025; pp. 227–238. [Google Scholar] [CrossRef]
Al-Saeedi, A.; Härmä, A. Emergence of symbolic abstraction heads for in-context learning in large language models. In Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning @ COLING 2025; Liu, K., Song, Y., Han, Z., Sifa, R., He, S., Long, Y., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2025; pp. 86–96. [Google Scholar]
Xie, S.M.; Raghunathan, A.; Liang, P.; Ma, T. An Explanation of In-context Learning as Implicit Bayesian Inference. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Nichani, E.; Damian, A.; Lee, J.D. How Transformers Learn Causal Structure with Gradient Descent. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Vapnik, V.N.; Chervonenkis, A.Y. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory Probab. Its Appl. 1971, 16, 264–280. [Google Scholar] [CrossRef]
Ramsauer, H.; Schäfl, B.; Lehner, J.; Seidl, P.; Widrich, M.; Adler, T.; Gruber, L.; Holzleitner, M.; Pavlović, M.; Sandve, G.K.; et al. Hopfield Networks is All You Need. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Radhakrishnan, A.; Belkin, M.; Uhler, C. Overparameterized neural networks implement associative memory. Proc. Natl. Acad. Sci. USA 2020, 117, 27162–27170. [Google Scholar] [CrossRef] [PubMed]
Bietti, A.; Cabannes, V.; Bouchacourt, D.; Jegou, H.; Bottou, L. Birth of a Transformer: A Memory Viewpoint. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 1560–1588. [Google Scholar]
Schaeffer, R.; Zahedi, N.; Khona, M.; Pai, D.; Truong, S.; Du, Y.; Ostrow, M.; Chandra, S.; Carranza, A.; Fiete, I.R.; et al. Bridging Associative Memory and Probabilistic Modeling. arXiv 2024, arXiv:2402.10202. [Google Scholar] [CrossRef]
Allen-Zhu, Z.; Li, Y. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Allen-Zhu, Z.; Li, Y. Physics of Language Models: Part 3.2, Knowledge Manipulation. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Allen-Zhu, Z.; Li, Y. Physics of Language Models: Part 3.1, Knowledge Storage and Extraction. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling Laws for Neural Language Models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
McEliece, R.; Posner, E.; Rodemich, E.; Venkatesh, S. The capacity of the Hopfield associative memory. IEEE Trans. Inf. Theory 1987, 33, 461–482. [Google Scholar] [CrossRef]
Tang, M.; Salvatori, T.; Millidge, B.; Song, Y.; Lukasiewicz, T.; Bogacz, R. Recurrent predictive coding models for associative memory employing covariance learning. PLoS Comput. Biol. 2023, 19, e1010719. [Google Scholar] [CrossRef]
Zhong, C.; Pedrycz, W.; Li, Z.; Wang, D.; Li, L. Fuzzy associative memories: A design through fuzzy clustering. Neurocomputing 2016, 173, 1154–1162. [Google Scholar] [CrossRef]
Steinberg, J.; Sompolinsky, H. Associative memory of structured knowledge. Sci. Rep. 2022, 12, 21808. [Google Scholar] [CrossRef]
Tavan, P.; Grubmüller, H.; Kühnel, H. Self-organization of associative memory and pattern classification: Recurrent signal processing on topological feature maps. Biol. Cybern. 1990, 64, 95–105. [Google Scholar] [CrossRef]
Krauth, W.; Nadal, J.P.; Mezard, M. The roles of stability and symmetry in the dynamics of neural networks. J. Phys. A Math. Gen. 1988, 21, 2995. [Google Scholar] [CrossRef]
Baum, E.B. On the capabilities of multilayer perceptrons. J. Complex. 1988, 4, 193–215. [Google Scholar] [CrossRef]
Vardi, G.; Yehudai, G.; Shamir, O. On the Optimal Memorization Power of ReLU Neural Networks. In Proceedings of the International Conference on Learning Representations, ICLR 2022, Virtual, 25–29 April 2022. [Google Scholar]
Geva, M.; Schuster, R.; Berant, J.; Levy, O. Transformer Feed-Forward Layers Are Key-Value Memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; Moens, M.F., Huang, X., Specia, L., Yih, S.W.t., Eds.; Association for Computational Linguistics: Punta Cana, Dominican Republic, 2021; pp. 5484–5495. [Google Scholar] [CrossRef]
El-Sappagh, S.; Franda, F.; Ali, F.; Kwak, K.S. SNOMED CT standard ontology based on the ontology for general medical science. BMC Med. Inform. Decis. Mak. 2018, 18, 76. [Google Scholar] [CrossRef]
Lamy, J.B. Owlready: Ontology-oriented programming in Python with automatic classification and high level constructs for biomedical ontologies. Artif. Intell. Med. 2017, 80, 11–28. [Google Scholar] [CrossRef] [PubMed]
Shehzad, A.; Xia, F.; Abid, S.; Peng, C.; Yu, S.; Zhang, D.; Verspoor, K. Graph Transformers: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2026. [Google Scholar] [CrossRef] [PubMed]
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. In Proceedings of the NeurIPS 2017 Workshop on Autodiff, Long Beach, CA, USA, 8 December 2017. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv 2019, arXiv:1910.03771. [Google Scholar]
Agarap, A.F. Deep Learning using Rectified Linear Units (ReLU). arXiv 2018, arXiv:1803.08375. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Xu, B.; Wang, N.; Chen, T.; Li, M. Empirical Evaluation of Rectified Activations in Convolutional Network. arXiv 2015, arXiv:1505.00853. [Google Scholar] [CrossRef]
Bridle, J.S. Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. In Proceedings of the NATO Neurocomputing; Springer: Berlin/Heidelberg, Germany, 1989. [Google Scholar]
Zhou, H.; Bradley, A.; Littwin, E.; Razin, N.; Saremi, O.; Susskind, J.M.; Bengio, S.; Nakkiran, P. What Algorithms can Transformers Learn? A Study in Length Generalization. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Zhou, Y.; Alon, U.; Chen, X.; Wang, X.; Agarwal, R.; Zhou, D. Transformers Can Achieve Length Generalization But Not Robustly. In Proceedings of the Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo) at ICLR 2024, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Cho, H.; Cha, J.; Awasthi, P.; Bhojanapalli, S.; Gupta, A.; Yun, C. Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Volume 37. [Google Scholar]
Guan, X.; Zhang, L.L.; Liu, Y.; Shang, N.; Sun, Y.; Zhu, Y.; Yang, F.; Yang, M. rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking. In Proceedings of the 42nd International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025; Volume 267, pp. 20640–20661. [Google Scholar]
Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y.K.; Wu, Y.; et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv 2024, arXiv:2402.03300. [Google Scholar]
Zhang, Z.; Zheng, C.; Wu, Y.; Zhang, B.; Lin, R.; Yu, B.; Liu, D.; Zhou, J.; Lin, J. The Lessons of Developing Process Reward Models in Mathematical Reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; pp. 10495–10516. [Google Scholar] [CrossRef]
Nye, M.; Andreassen, A.J.; Gur-Ari, G.; Michalewski, H.; Austin, J.; Bieber, D.; Dohan, D.; Lewkowycz, A.; Bosma, M.; Luan, D.; et al. Show Your Work: Scratchpads for Intermediate Computation with Language Models. In Proceedings of the Workshop on Deep Learning for Code (DL4C) at ICLR 2022, Virtually, 29 April 2022. [Google Scholar]
Saxton, D.; Grefenstette, E.; Hill, F.; Kohli, P. Analysing Mathematical Reasoning Abilities of Neural Models. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Anil, C.; Wu, Y.; Andreassen, A.; Lewkowycz, A.; Misra, V.; Ramasesh, V.; Slone, A.; Gur-Ari, G.; Dyer, E.; Neyshabur, B. Exploring Length Generalization in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 8–9 November 2022; Volume 35. [Google Scholar]
Zhang, Y.; Backurs, A.; Bubeck, S.; Eldan, R.; Gunasekar, S.; Wagner, T. Unveiling Transformers with LEGO: A Synthetic Reasoning Task. arXiv 2022, arXiv:2206.04301. [Google Scholar]
Kazemnejad, A.; Padhi, I.; Ramamurthy, K.N.; Das, P.; Reddy, S. The Impact of Positional Encoding on Length Generalization in Transformers. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Volume 36. [Google Scholar]
Ruoss, A.; Delétang, G.; Genewein, T.; Grau-Moya, J.; Csordás, R.; Bennani, M.; Legg, S.; Veness, J. Randomized Positional Encodings Boost Length Generalization of Transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 1889–1903. [Google Scholar] [CrossRef]
Wang, J.; Ji, T.; Wu, Y.; Yan, H.; Gui, T.; Zhang, Q.; Huang, X.; Wang, X. Length Generalization of Causal Transformers without Position Encoding. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 14024–14040. [Google Scholar] [CrossRef]
Duan, S.; Shi, Y.; Xu, W. From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers. arXiv 2024, arXiv:2310.11984. [Google Scholar] [CrossRef]
Dubois, Y.; Dagan, G.; Hupkes, D.; Bruni, E. Location Attention for Extrapolation to Longer Sequences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 403–413. [Google Scholar] [CrossRef]
Kumar, T.; Ankner, Z.; Spector, B.F.; Bordelon, B.; Muennighoff, N.; Paul, M.; Pehlevan, C.; Ré, C.; Raghunathan, A. Scaling Laws for Precision. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Motamedi, M.; Sakharnykh, N.; Kaldewey, T. A Data-Centric Approach for Training Deep Neural Networks with Less Data. In Proceedings of the NeurIPS 2021 Workshop on Data-Centric AI, Virtual, 14 December 2021. [Google Scholar]
Lee, N.; Sreenivasan, K.; Lee, J.D.; Lee, K.; Papailiopoulos, D. Teaching Arithmetic to Small Transformers. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Shen, R.; Bubeck, S.; Eldan, R.; Lee, Y.T.; Li, Y.; Zhang, Y. Positional Description Matters for Transformers Arithmetic. arXiv 2023, arXiv:2311.14737. [Google Scholar] [CrossRef]
Olsson, C.; Elhage, N.; Nanda, N.; Joseph, N.; DasSarma, N.; Henighan, T.; Mann, B.; Askell, A.; Bai, Y.; Chen, A.; et al. In-context Learning and Induction Heads. arXiv 2022, arXiv:2209.11895. [Google Scholar] [CrossRef]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-Attention with Relative Position Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 464–468. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V.; et al. OPT: Open Pre-trained Transformer Language Models. arXiv 2022, arXiv:2205.01068. [Google Scholar] [CrossRef]
Press, O.; Smith, N.A.; Lewis, M. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Chi, T.C.; Fan, T.H.; Ramadge, P.J.; Rudnicky, A.I. KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35. [Google Scholar]
Li, S.; You, C.; Guruganesh, G.; Ainslie, J.; Ontanon, S.; Zaheer, M.; Sanghai, S.; Yang, Y.; Kumar, S.; Bhojanapalli, S. Functional Interpolation for Relative Positions Improves Long Context Transformers. In Proceedings of the International Conference on Learning Representations, ICLR’24, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Haviv, A.; Ram, O.; Press, O.; Izsak, P.; Levy, O. Transformer Language Models without Positional Encodings Still Learn Positional Information. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 1382–1390. [Google Scholar] [CrossRef]
Su, J.; Lu, Y.; Pan, S.; Murtadha, A.; Wen, B.; Liu, Y. RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
Chen, S.; Wong, S.; Chen, L.; Tian, Y. Extending Context Window of Large Language Models via Positional Interpolation. arXiv 2023, arXiv:2306.15595. [Google Scholar] [CrossRef]
Peng, B.; Quesnelle, J.; Fan, H.; Shippole, E. YaRN: Efficient Context Window Extension of Large Language Models. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Wang, P. Lucidrains/x-Transformers. 2024. Original-Date: 2020-10-24T22:13:25Z. Available online: https://github.com/lucidrains/x-transformers (accessed on 1 March 2026).
McCandlish, S.; Kaplan, J.; Amodei, D.; OpenAI Dota Team. An Empirical Model of Large-Batch Training. arXiv 2018, arXiv:1812.06162. [Google Scholar] [CrossRef]
Shen, K.; Guo, J.; Tan, X.; Tang, S.; Wang, R.; Bian, J. A Study on ReLU and Softmax in Transformer. arXiv 2023, arXiv:2302.06461. [Google Scholar] [CrossRef]
Jelassi, S.; d’Ascoli, S.; Domingo-Enrich, C.; Wu, Y.; Li, Y.; Charton, F. Length Generalization in Arithmetic Transformers. arXiv 2023, arXiv:2306.15400. [Google Scholar] [CrossRef]
Sabbaghi, M.; Pappas, G.; Hassani, H.; Goel, S. Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks. arXiv 2024, arXiv:2406.01895. [Google Scholar] [CrossRef]
Ju, Y.; Isac, A.; Nie, Y. ChunkFormer: Learning Long Time Series with Multi-stage Chunked Transformer. arXiv 2021, arXiv:2112.15087. [Google Scholar]
He, S.; Sun, G.; Shen, Z.; Li, A. Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping. Trans. Mach. Learn. Res. 2026. Available online: https://openreview.net/forum?id=1I7PCbOPfe (accessed on 1 March 2026).
Paik, I.; Choi, J. The Disharmony between BN and ReLU Causes Gradient Explosion, but is Offset by the Correlation between Activations. arXiv 2023, arXiv:2304.11692. [Google Scholar] [CrossRef]
Chen, W.; Ge, H. Neural Characteristic Activation Analysis and Geometric Parameterization for ReLU Networks. In Proceedings of the NeurIPS’24, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Fu, J.; Yang, T.; Wang, Y.; Lu, Y.; Zheng, N. Breaking through the Learning Plateaus of In-context Learning in Transformer. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; Volume 235, pp. 14207–14227. [Google Scholar]
Elhage, N.; Nanda, N.; Olsson, C.; Henighan, T.; Joseph, N.; Mann, B.; Askell, A.; Bai, Y.; Chen, A.; Conerly, T.; et al. A Mathematical Framework for Transformer Circuits. Transform. Circuits Thread 2021, 1, 12. Available online: https://transformer-circuits.pub/2021/framework/index.html (accessed on 1 March 2026).
Nanda, N.; Chan, L.; Lieberum, T.; Smith, J.; Steinhardt, J. Progress Measures for Grokking via Mechanistic Interpretability. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Beck, M.; Pöppel, K.; Lippe, P.; Kurle, R.; Blies, P.M.; Klambauer, G.; Böck, S.; Hochreiter, S. xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference. In Proceedings of the 42nd International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025; pp. 3335–3357. [Google Scholar]

Figure 1. A typical transformer system with multiple self-attention and feedforward elements.

Figure 2. A comparison of the measured capacity of models with and without trainable FFN elements.

Figure 3. Counts of coefficients (millions) for various transformer network variants.

Figure 4. Diagrams representing the generation of triplets (A) and sequences (B) from KG.

Figure 5. Position coupling for a decimal integer addition task, presenting 653 + 49 = 702 using the correct input formats. Colors indicate the three token groups: the first operand together with the ‘+’ token (orange), the second operand (green), and the ‘=’ token together with the reversed sum (blue); gray denotes end-of-sequence padding tokens. Position IDs appear below each token, and the brackets beneath indicate the couplings between digits of matching significance. The initial position identifier ‘6’ is an arbitrarily selected number [54].

Figure 6. Comparative analysis of PEs and data formats [52,53,54,62,69,70]: Unlike most studies that primarily explore APE or NoPE, ref. [53] method combines FIRE [78] with Randomized PE [63]. All approaches adopt a reversed data format, except for [54], which applies the reversed format only to the sum of integers, incorporating zero-padding and position coupling. Ref. [70] further improve this format by introducing random space augmentation, while both [52] and [53] methods leverage index hints to enhance performance.

Figure 7. Comparison of capacity measurements for

H = 4, L = 1

using the MLS and MAC methods.

Figure 7. Comparison of capacity measurements for

H = 4, L = 1

using the MLS and MAC methods.

Figure 8. Number of epochs required to overlearn, shutter, the full dataset.

Figure 9. The influence of the size of the training batch on the model performance, with

H = 1

and

B = 512

.

Figure 9. The influence of the size of the training batch on the model performance, with

H = 1

and

B = 512

.

Figure 10. Results of our ECM for

H = 1

. Solid curves show the calculated capacity values, while dashed curves indicate the corresponding predictions of our model.

Figure 10. Results of our ECM for

H = 1

. Solid curves show the calculated capacity values, while dashed curves indicate the corresponding predictions of our model.

Figure 11. Results of the proposed ECM. Solid lines indicate the measured capacity values, while dashed lines correspond to the capacities predicted by our model.

Figure 12. Slopes corresponding to the linear rise in C as a function of B with respect to H and N.

Figure 13. Estimated memorization capacity for 64-token sequences in different models sizes and architectures.

Figure 14. Training accuracy (top) and capacity (bottom) for Setup 1 with different triplet dataset sizes. (Left): first 30 epochs; (Right): full 500-epoch training run. Colors denote dataset size. Curves are averaged over 10 repeated runs per configuration; shaded regions indicate variability (

\pm 2 σ

).

Figure 14. Training accuracy (top) and capacity (bottom) for Setup 1 with different triplet dataset sizes. (Left): first 30 epochs; (Right): full 500-epoch training run. Colors denote dataset size. Curves are averaged over 10 repeated runs per configuration; shaded regions indicate variability (

\pm 2 σ

).

Figure 15. Training capacity for Setup 2 with different triplet dataset sizes, activation functions, and numbers of layers. (Left): first 30 epochs; (Right): full 1000-epoch training run. Colors denote layer-count and dataset-size combinations; line styles denote activation functions. Curves are averaged over 10 repeated runs per configuration; shaded regions indicate variability (

\pm 2 σ

).

Figure 15. Training capacity for Setup 2 with different triplet dataset sizes, activation functions, and numbers of layers. (Left): first 30 epochs; (Right): full 1000-epoch training run. Colors denote layer-count and dataset-size combinations; line styles denote activation functions. Curves are averaged over 10 repeated runs per configuration; shaded regions indicate variability (

\pm 2 σ

).

Figure 16. Training accuracy (top) and capacity (bottom) for Setup 3 with different triplet dataset sizes, embedding-size settings, and numbers of layers. (Left): first 50 epochs; (Right): full 500-epoch training run. Light and dark colors indicate 1 and 2 layers, respectively; color labels indicate the embedding-size setting used for the full model: green—16, blue—32, violet—64, and red—128. For the two-layer models, the effective per-layer embedding size is obtained by dividing this value by the number of layers. Line styles denote dataset size. Curves are averaged over 3 repeated runs per configuration; shaded regions indicate variability (

\pm 2 σ

).

Figure 16. Training accuracy (top) and capacity (bottom) for Setup 3 with different triplet dataset sizes, embedding-size settings, and numbers of layers. (Left): first 50 epochs; (Right): full 500-epoch training run. Light and dark colors indicate 1 and 2 layers, respectively; color labels indicate the embedding-size setting used for the full model: green—16, blue—32, violet—64, and red—128. For the two-layer models, the effective per-layer embedding size is obtained by dividing this value by the number of layers. Line styles denote dataset size. Curves are averaged over 3 repeated runs per configuration; shaded regions indicate variability (

\pm 2 σ

).

Figure 17. Training capacity for Setup 4 with different sequence dataset sizes, activation functions, and numbers of layers. (Left): first 30 epochs; (Right): full 400-epoch training run. Colors denote layer-count and dataset-size combinations; line styles denote activation functions. Curves are averaged over 3 repeated runs per configuration; shaded regions indicate variability (

\pm 2 σ

).

Figure 17. Training capacity for Setup 4 with different sequence dataset sizes, activation functions, and numbers of layers. (Left): first 30 epochs; (Right): full 400-epoch training run. Colors denote layer-count and dataset-size combinations; line styles denote activation functions. Curves are averaged over 3 repeated runs per configuration; shaded regions indicate variability (

\pm 2 σ

).

Figure 18. Training loss of four independent runs of the 1-layer, 8-head model. Colored lines (blue, red, yellow, and gray) are used to distinguish the four runs.

Figure 19. Accuracy of the last digit in the fourth instance of a pattern across four independent runs of the 1-layer, 8-head model. Colored lines (blue, red, yellow, and gray) distinguish the four runs.

Figure 20. Accuracy of the variable matching task across four independent runs of the 1-layer, 8-head model. Colored lines (blue, red, yellow, and gray) distinguish the four runs.

Figure 21. Training loss of five independent runs of the 2-layer, 4-head model. The colored curves represent five separate runs.

Figure 22. Accuracy of the last digit task across five independent runs of the 2-layer, 4-head model. Colored lines distinguish the five runs.

Figure 23. Accuracy of the variable matching task across five independent runs of the 2-layer, 4-head model. Colored lines distinguish the five runs.

Figure 24. Attention head responsible for abstraction in 2-layer, 4-head model.

Figure 25. Training loss of five independent runs of the 3-layer, 2-head model. Colored lines distinguish the five runs.

Figure 26. Accuracy of the last digit task across five independent runs of the 3-layer, 2-head model. Colored lines distinguish the five runs.

Figure 27. Accuracy of the variable matching task across five independent runs of the 3-layer, 2-head model. Colored lines distinguish the five runs.

Figure 28. Attention head responsible for abstraction in 3-layer, 2-head model.

Table 1. Examples of the input and output of the algorithmic tasks.

Task Type	Question	Answer
Addition	Compute: 53,726 + 19,177	72,903
Polynomial Eval.	Evaluate $x = 3$ in $3 x^{0} + 1 x^{1} + 1 x^{2} mod 10$	5
Sorting	Sort the numbers: $3, 1, 4, 1, 5$	$1, 1, 3, 4, 5$
Summation	Compute $(1 + 2 + 3 + 4 + 7) mod 10$	7
Parity	Are the number of 1’s even in $[1, 0, 0, 1, 1]$ ?	No
LEGO	$a = - 1$ ; $b = - a$ ; $c = + b$ ; $d = + c$ . Find c?	$+ 1$

Table 2. Parameters of the learned capacity models (slope and saturation) and corresponding goodness of fit measures (

R^{2}

, RMSE, MAPE).

Table 2. Parameters of the learned capacity models (slope and saturation) and corresponding goodness of fit measures (

R^{2}

, RMSE, MAPE).

Layers	a	b	c	d	e	$α$	$β$	$R^{2}$	RMSE	MAPE
1	145.27	−0.13	1.29	0.13	0.20	3762.70	8741.00	0.95	820.30	0.50
2	2.45	−0.002	0.02	−0.99	−29.08	4413.10	14,787.00	0.88	2293.89	0.83

Table 3. Final training outcomes for Setup 1 with different triplet dataset sizes. Values are averaged over 10 repeated runs per configuration and reported as mean

\pm 2 σ

.

Table 3. Final training outcomes for Setup 1 with different triplet dataset sizes. Values are averaged over 10 repeated runs per configuration and reported as mean

\pm 2 σ

.

Data Size	Accuracy, %	Capacity
$50,000$	$93.62 \pm 0.3$	$46,811 \pm 149$
$60,000$	$92.42 \pm 0.2$	$55,455 \pm 126$
$70,000$	$91.1 \pm 1.08$	$63,773 \pm 756$
$80,000$	$89.63 \pm 1.66$	$71,706 \pm 1326$
$90,000$	$87.24 \pm 1.66$	$78,517 \pm 2173$
$100,000$	$86.78 \pm 2.42$	$86,776 \pm 2484$

Table 4. A compact comparison of approaches for improving length generalization in arithmetic transformers. The table lists various methods along with their training ranges (in digits), the maximum sequence length to which they generalize, and the corresponding extension factors, indicating the multiplicative improvement over the training length.

Approach	Training Range	Generalizes to	Extension Factor
Standard APE (Baseline)	Up to 40	∼40	1–1.125×
FIRE + Reversed + Index Hints	Up to 40	∼100	2.5×
Relative Positional Encoding	Up to 5	∼15	3×
Position Coupling	1–30	∼200	6.67×
Explicit Structural Symmetry	Up to 5	∼50	10×

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Härmä, A.; Al-Saeedi, A.; Changalidis, A.; Verşebeniuc, D.; Pietrasik, M.; Wilbik, A. On Memorization and Generalization in Compact Transformers. Electronics 2026, 15, 1847. https://doi.org/10.3390/electronics15091847

AMA Style

Härmä A, Al-Saeedi A, Changalidis A, Verşebeniuc D, Pietrasik M, Wilbik A. On Memorization and Generalization in Compact Transformers. Electronics. 2026; 15(9):1847. https://doi.org/10.3390/electronics15091847

Chicago/Turabian Style

Härmä, Aki, Ali Al-Saeedi, Anton Changalidis, Dumitru Verşebeniuc, Marcin Pietrasik, and Anna Wilbik. 2026. "On Memorization and Generalization in Compact Transformers" Electronics 15, no. 9: 1847. https://doi.org/10.3390/electronics15091847

APA Style

Härmä, A., Al-Saeedi, A., Changalidis, A., Verşebeniuc, D., Pietrasik, M., & Wilbik, A. (2026). On Memorization and Generalization in Compact Transformers. Electronics, 15(9), 1847. https://doi.org/10.3390/electronics15091847

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On Memorization and Generalization in Compact Transformers

Abstract

1. Introduction

1.1. Research Gap

1.2. Contributions

1.3. Novelty

2. Methods and Materials

2.1. Memorization of Random Sequences

2.1.1. Capacity in Networks

2.1.2. Transformer Models

2.1.3. Data Generation

2.2. Memorization of Synthetic Sentences

2.2.1. Data Generation

2.2.2. Triplets Generation

2.2.3. Sequences Generation

2.2.4. Transformer Training

2.2.5. Triplets and Sequences Memorization

2.3. Generalization and Abstraction

2.3.1. Data Generation

2.3.2. Evaluation

2.3.3. Models

2.4. Generalization in Algorithmic Tasks

2.4.1. Preliminaries

2.4.2. Data Generation

Reversed Format

Index Hints

Random Space Augmentation

Zero-Padding

2.4.3. Positional Embeddings/Encodings (PE)

Absolute Positional Encoding (APE)

Additive Relative Positional Encoding (RPE)

Position Coupling

Randomized Position Encoding

No Positional Encoding (NoPE)

Rotary Positional Encoding (RoPE)

3. Results

3.1. Memorization of Random Sequences

3.1.1. Capacity in the Two Formulations

3.1.2. Impact of Batch Size

3.1.3. Empirical Capacity Model

3.2. Memorization of Synthetic Sentences

3.2.1. Dataset Size Influence

3.2.2. Architectural Variations

3.2.3. Embedding-Size Influence Under a Fixed Budget

3.2.4. Insights from Sequence Datasets

3.3. Generalization and Abstraction

3.3.1. Results for the One Layer and Eight Heads

3.3.2. Results for the Two Layers and Four Heads

3.3.3. Results for the Three Layers and Two Heads

3.4. Generalization in Algorithmic Tasks

3.4.1. Enhancing Generalization via Positional Encoding and Data Formatting

3.4.2. Leveraging Task Structure: Position Coupling and Structural Symmetry

4. Discussion

4.1. Memorization of Random Sequences

4.2. Memorization of Synthetic Sentences

4.2.1. Effect of Dataset Structure

4.2.2. Architectural Influence

4.3. Generalization and Abstraction

4.3.1. Interpreting the One Layers and Eight Heads

4.3.2. Interpreting the Two Layers and Four Heads

4.3.3. Interpreting the Three Layers and Two Heads

4.3.4. Abstraction Heads

4.4. Generalization in Algorithmic Tasks

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI