2.1. Theory and Explanation
Portions of this section are inspired by the formal style and explanatory structure of our prior work on explanation-guided optimization; however, all theoretical formulations presented here are newly developed for the proposed Cognitive Load–informed Transformer (CLT) attention mechanism. Unlike attribution-based optimization methods, which modulate parameter updates using feature-level importance scores, CLT directly alters the attention dynamics of a transformer by introducing a cognitively motivated, token-level constraint on attention allocation.
Throughout this section, we adopt the following notational conventions, consistent with standard practice in the deep learning literature:
Bold lowercase symbols (e.g., x, zi, q, k, v) denote vectors.
Bold uppercase symbols (e.g., Q, K, V, P) denote matrices or tensors.
Non-bold symbols (e.g., α, β, , ) denote scalars.
Subscripts index token positions in a sequence of length .
Subscripts index attention heads.
Superscripts index layers when layer-wise behavior is essential.
Unless otherwise stated, we omit the layer index (ℓ) in equations where the context is unambiguous, in order to improve clarity and readability.
2.1.1. Transformer Self-Attention
Let an input sequence be represented as
where each
is a token index from a vocabulary of size
. Tokens are mapped to embeddings using an embedding matrix
, yielding
After adding positional encodings
, the initial sequence representation becomes
In a transformer encoder, representations are updated through stacked layers of multi-head self-attention and feed-forward networks. At a given layer, we collect token representations into a matrix
. Linear projections produce queries, keys, and values:
with learned weight matrices
. These matrices are partitioned into
heads of dimension
. For head
, scaled dot-product attention yields
where
and each row
represents the attention distribution emitted by token
. The attended values are
Outputs from all heads are concatenated and linearly transformed to obtain the updated token representations. Residual connections and layer normalization complete the layer update.
2.1.2. Cognitive Load as a Token-Level Computational Signal
In Cognitive Load Theory, effective information processing depends on managing limited cognitive resources. Translating this idea to transformer attention, we view each token as imposing a varying degree of computational demand on the model. Tokens that are ambiguous, rare, or difficult to integrate into the current context should consume more attentional resources, while less informative tokens should exert limited influence.
We formalize this intuition by introducing a token-level cognitive load scalar , defined at each layer . Higher values of indicate greater processing demand. This load is constructed from three complementary signals:
Attention entropy, capturing how uncertain the model is about where token should attend.
Margin-based uncertainty, capturing how confident the model is in its current prediction.
Lexical importance, capturing the informativeness of token via inverse document frequency (IDF).
Each component is normalized and combined into a single scalar load value.
In the terminology of Cognitive Load Theory, the signals used in this formulation primarily approximate intrinsic cognitive load, which arises from the inherent difficulty of processing the information associated with a token. Attention entropy reflects how uncertain the model is about contextual relationships, margin-based uncertainty captures ambiguity in the predicted class, and IDF-based importance reflects the informational density or rarity of lexical items. Together, these signals estimate how demanding a token may be for the model to process within the current context. While extraneous cognitive load in human learning typically arises from suboptimal presentation or task design, the present formulation focuses on intrinsic processing demands within the sequence representation itself.
2.1.3. Components of Cognitive Load
- (a)
Attention entropy
Let
denote the attention probability matrix for head
at layer
. For token
, the entropy of its outgoing attention distribution in head
is
where
ensures numerical stability. We average across heads to obtain a single entropy value per token:
This quantity is large when token
distributes attention diffusely and small when attention is concentrated. To ensure comparability across tokens, we normalize entropy within the sequence:
- (b)
Margin-based uncertainty
Let
denote the class logits produced from the pooled representation at layer
, and let
and
be the largest and second-largest logits, respectively. The classification margin is
Small margins indicate high uncertainty. We convert this to a normalized uncertainty score:
where
and
are computed within the mini-batch or tracked as running statistics. This scalar is broadcast to all tokens:
This reflects the assumption that when the model is uncertain about the instance as a whole, all tokens contribute to cognitive load.
- (c)
IDF-based lexical importance
For each vocabulary token
, inverse document frequency is computed from the training corpus:
where
is the number of documents and
is the document frequency. IDF values are normalized to
across the vocabulary. For token position
with token index
, we define
Rare, information-dense tokens thus receive higher lexical importance.
- (d)
Combined cognitive load
The final cognitive load for token
at layer
is
where
are fixed weights. In our experiments, we explore entropy-only and mixed variants of this formulation.
2.1.4. Attention Budgets and CLT-Constrained Attention
To translate cognitive load into a constraint on attention allocation, we introduce a token-wise attention budget
. Let
and
denote lower and upper bounds on allowable attention mass. We define
Tokens with higher cognitive load are granted larger budgets, allowing them to exert greater influence on the attention mechanism.
Let
denote the original attention probabilities for head
. CLT rescales each row as
This ensures that the total outgoing attention mass from token
satisfies
while preserving the relative distribution over keys. The scaled probabilities
are then used to compute attended values. Unlike standard attention normalization, the scaled attention weights are not renormalized to sum to one. Instead, the outgoing attention mass is intentionally bounded by the token-specific budget
, allowing the model to regulate how much influence each token can exert on the aggregated representation.
2.1.5. Training Integration of CLT-Constrained Attention
The CLT mechanism is fully differentiable and integrated directly into the forward pass. During training, cognitive load and budgets are recomputed at each layer and iteration, allowing attention patterns to adapt dynamically as representations evolve. Importantly, CLT modifies only attention probabilities, leaving the optimizer, loss function, and model architecture unchanged.
By constraining attention through cognitively motivated budgets, CLT encourages the model to allocate representational capacity selectively, stabilizing attention entropy and improving calibration without increasing computational complexity.
This formulation provides a principled bridge between Cognitive Load Theory and transformer attention. In contrast to architectural efficiency methods or post hoc interpretability techniques, CLT introduces an explicit, interpretable resource constraint that shapes attention behavior throughout training and inference.
2.2. Methodology
This section describes the experimental protocol used to evaluate the proposed Cognitive Load-Informed Transformer (CLT) attention mechanism. While the underlying training and evaluation pipeline follows standard practices for transformer-based text classification, it is modified to incorporate CLT-specific attention budgeting within the model’s forward pass. The methodology is designed to isolate the effects of cognitive load–based attention constraints while maintaining consistency across datasets, model configurations, and optimization settings.
To ensure rigorous and reproducible evaluation, all experiments are conducted using a unified workflow that includes controlled data preprocessing, a consistent model architecture, systematic hyperparameter tuning, and repeated trials with randomized data splits. CLT is evaluated against a baseline transformer model trained without cognitive load constraints, allowing direct assessment of its impact on performance, stability, and attention allocation behavior.
The following subsections describe the workflow, datasets, model architecture, optimization procedures, experimental setup, and computational complexity analysis in detail.
2.2.1. Workflow Overview
To provide a comprehensive view of the experimental pipeline,
Figure 1 presents a structured overview of the training and evaluation workflow, highlighting where CLT-augmented attention constraints are integrated into the model’s forward pass. The diagram summarizes the end-to-end process, including dataset preparation, model initialization, CLT-augmented training, validation-based early stopping, and final evaluation. While several components of the workflow follow standard transformer training practices, the figure highlights where cognitive load-based attention constraints are integrated into the model’s forward pass.
At the outer level, experiments are organized around a fixed set of predefined model variants, including a baseline transformer, a fixed-budget control, and multiple CLT configurations that differ in attention budget bounds and cognitive load weighting parameters. Each configuration is evaluated independently using repeated trials with different random seeds to account for variance due to initialization and data ordering.
Within each trial, standard preprocessing steps are applied prior to training, including dataset loading, tokenization, vocabulary construction, and computation of IDF statistics using the training split only. Datasets are partitioned into training, validation, and test subsets according to a consistent protocol to ensure comparability across experimental conditions.
Models are initialized using a unified transformer architecture and trained using the Adam optimizer with a cosine learning rate schedule and warmup. During training, each mini-batch undergoes a forward pass through the transformer encoder. When CLT is enabled, token-level cognitive load is computed dynamically within each attention layer using attention entropy, prediction margin uncertainty, and IDF-based lexical importance. These load estimates are mapped to token-specific attention budgets, which constrain the outgoing attention distributions prior to value aggregation. When CLT is disabled, the model reverts to standard multi-head self-attention.
After each training epoch, model performance is evaluated on a held-out validation set. Early stopping is applied based on validation loss to prevent overfitting, and the best-performing model state within each trial is retained. Once training concludes, the selected model is evaluated on the test set, where both task-level performance metrics (e.g., accuracy, loss, and runtime) and CLT-specific allocation metrics (e.g., attention entropy and budget utilization statistics) are computed.
This workflow ensures that all comparisons between baseline and CLT-augmented models are conducted under identical experimental conditions, with differences in performance attributable solely to the inclusion of cognitive load-informed attention constraints.
2.2.2. Data Preprocessing
To evaluate the proposed cognitive load–aware attention mechanism across heterogeneous natural language classification tasks, experiments are conducted on four benchmark datasets: IMDB, AG News, SST-2, and DBpedia. These datasets vary substantially in document length, number of classes, and semantic structure, enabling systematic evaluation of CLT-based attention under diverse linguistic and classification conditions.
To ensure methodological consistency across datasets, a unified partitioning strategy is adopted for IMDB, AG News, and DBpedia. For these datasets, the originally provided training and test splits are first merged into a single corpus and then reshuffled using a fixed random seed. The shuffled corpus is partitioned into 70% training, 15% validation, and 15% test subsets. The validation subset is used exclusively for early stopping during model training, while the test subset is reserved strictly for final performance evaluation. This reshuffling and repartitioning procedure ensures that all datasets are evaluated under a consistent training-to-validation-to-test ratio, eliminating discrepancies introduced by dataset-specific predefined splits and enabling direct comparability of results across tasks. All reshuffling operations are controlled by fixed seeds, and each experiment is repeated across ten seeds to ensure reproducibility and statistical robustness.
The SST-2 dataset requires a different handling procedure due to its distribution within the GLUE benchmark framework. The official GLUE test split does not provide publicly available labels, preventing direct evaluation under the unified reshuffling protocol used for the other datasets. Consequently, the official GLUE training split is used for model training, and the official GLUE validation split is used for final evaluation. No merging or reshuffling of predefined splits is performed for SST-2. This deviation from the unified partitioning strategy is necessary to preserve compatibility with the GLUE benchmark structure while maintaining strict separation between training and evaluation data.
Across all datasets, identical preprocessing steps are applied to ensure fair comparison between baseline and CLT-augmented models. Input text is lowercased and tokenized using a simple whitespace tokenizer. Vocabulary construction is performed using only the training subset of each dataset. Rare tokens are filtered according to a minimum frequency threshold, and an upper bound is imposed on vocabulary size to control memory usage. Sequences are truncated or padded to a fixed maximum length to enable batch processing, and tokens are converted to integer indices for embedding lookup. Inverse document frequency (IDF) statistics are computed exclusively from the training subset and normalized to the range [0, 1]. These IDF values are used only within CLT-enabled models as part of the cognitive load computation. Validation and test data are never used during vocabulary construction or IDF estimation, ensuring strict separation between training and evaluation information.
This unified preprocessing and evaluation protocol ensures that performance differences between models arise solely from differences in attention allocation mechanisms rather than from inconsistencies in data handling or partitioning.
2.2.3. Model Architecture
All experiments in this study employ a unified transformer-based text classification architecture, designed to isolate the effects of cognitive load–aware attention from confounding architectural variations. The same backbone architecture is used across all datasets and experimental conditions, with differences arising only from the attention mechanism configuration (baseline vs. CLT-enabled).
Each input document is represented as a sequence of token indices
where
denotes the fixed maximum sequence length. Tokens are mapped to dense embeddings using a learnable embedding matrix, producing an embedded sequence
where
is the model dimensionality.
To preserve token order information, sinusoidal positional encodings are added to the token embeddings prior to entering the transformer encoder stack. Positional encodings are implemented as fixed, non-trainable functions and are added in a gradient-safe manner to ensure compatibility with CLT-modified attention dynamics.
- 2.
Transformer Encoder Stack
The encoder consists of a stack of identical transformer blocks. Each block comprises:
Residual connections and layer normalization are applied around each sublayer in the standard transformer configuration. For an input sequence
to layer
, the block computes:
where LN denotes layer normalization and FFN denotes the feedforward network.
The feedforward sublayer consists of two linear transformations with a ReLU activation and dropout applied between them. Dropout is applied throughout the network to mitigate overfitting.
- 3.
Baseline vs. CLT-Enabled Attention
The baseline model employs standard scaled dot-product multi-head self-attention, where attention weights are computed using softmax normalization and are unconstrained beyond probabilistic normalization.
In contrast, CLT-enabled models replace the baseline attention mechanism with a budgeted multi-head attention module, as described in
Section 2.1. Importantly, this modification:
preserves the dimensionality and interface of standard attention,
operates entirely within the attention probability space, and
does not alter the surrounding transformer structure.
At each transformer layer in CLT mode, attention probabilities are dynamically reweighted using token-level budget constraints derived from cognitive load signals. These constraints are applied after softmax normalization and before value aggregation, ensuring compatibility with gradient-based optimization.
Aside from this attention modification, all other architectural components—including embeddings, feedforward layers, normalization, and residual connections—remain unchanged between baseline and CLT configurations.
- 4.
Sequence Pooling and Classification Head
Following the final transformer layer, token representations are aggregated using mean pooling across the sequence dimension:
The pooled representation is passed through a dropout layer and a linear classification head to produce output logits:
where the dimensionality of
corresponds to the number of target classes.
For multi-class datasets (AG News, DBpedia), logits are converted to probabilities using the softmax function. For binary classification tasks (IMDB, SST-2), softmax is applied over two output units to maintain a consistent probabilistic interface across experiments.
- 5.
Architectural Consistency Across Experiments
All datasets share the same model depth, hidden dimensionality, number of attention heads, and feedforward configuration. The only dataset-dependent variations are:
This architectural consistency ensures that observed performance differences arise from attention behavior and training dynamics, rather than from changes in model capacity or representational power.
All models are trained from randomly initialized parameters and do not leverage pretrained transformer weights or pretrained subword tokenization schemes. While pretrained language models typically achieve substantially higher absolute benchmark performance, the present architecture is intentionally trained from scratch to maintain strict control over representational capacity and to isolate the effects of attention allocation mechanisms. Consequently, absolute accuracy values are not intended to compete with state-of-the-art pretrained systems, but rather to provide a consistent basis for comparative analysis between baseline and CLT-enabled configurations.
2.2.4. Attention Mechanism Variants
All models evaluated in this study share an identical training protocol, model architecture, optimizer configuration, and learning rate schedule. The Adam optimizer is used uniformly across all experiments and is not treated as a variable of interest. Consequently, differences in performance can be attributed exclusively to how attention is allocated within the transformer encoder.
We evaluate a baseline transformer model and a family of Cognitive Load Theory (CLT)-based attention mechanisms. These variants differ only in the constraints applied to attention probabilities during the forward pass, enabling a controlled investigation of cognitively informed attention allocation.
The baseline model employs standard scaled dot-product multi-head self-attention without any cognitive load constraints or attention budgeting. Attention probabilities are computed using softmax normalization over scaled query–key dot products, allowing each token to distribute its full attention mass freely across the sequence.
This configuration represents conventional transformer attention behavior and serves as the primary reference model for all comparisons reported in this study.
- 2.
CLT-Based Adaptive Attention Models
The proposed CLT-based attention mechanisms introduce adaptive constraints on attention allocation by incorporating token-level cognitive load signals computed during the forward pass. For each token, a normalized cognitive load value is estimated and mapped to an allowable attention budget within a predefined range . Attention probabilities are then softly constrained to respect this budget while preserving differentiability and compatibility with gradient-based training.
We evaluate several CLT configurations that differ in how cognitive load is computed and how restrictive the resulting attention budgets are:
Cognitive load is computed exclusively from attention entropy, capturing uncertainty and dispersion in attention distributions. This variant represents the simplest form of adaptive CLT attention.
- b.
CLT-E-Tight (Entropy-Based Load with Tighter Budgets)
Identical to CLT-E, but with an increased minimum attention budget , producing tighter constraints and enabling analysis of budget sensitivity.
- c.
CLT-E/M/I (Full CLT Models)
Cognitive load is computed as a weighted combination of:
- ○
attention entropy (E)
- ○
prediction margin between the top two predicted class probabilities (M)
- ○
inverse document frequency-based lexical importance (I)
Multiple configurations of these weights and budget bounds are evaluated (e.g., CLT-B030-E40M40I20) to assess how different cognitive load components influence attention allocation and downstream performance.
Each configuration follows a naming convention of the form “CLT-Bxxx-ExxMxxIxx”, where denotes the base attention budget parameter, and , , and represent the relative weights assigned to entropy, margin-based uncertainty, and IDF-based lexical importance, respectively. The weights , , and correspond to normalized coefficients in the cognitive-load formulation, which are constrained to sum to one, whereas is an independent parameter that controls the allowable attention mass and is not part of the weighting distribution. For brevity, the “CLT–” prefix may be omitted in tables.
Across all CLT variants, the underlying transformer architecture, optimization procedure, and training schedule remain unchanged. CLT operates exclusively by modulating attention distributions within each layer, ensuring that performance differences arise from attention allocation behavior rather than changes in model capacity or optimization dynamics.
The evaluated attention mechanisms range from standard unconstrained transformer attention to adaptive, cognitively informed CLT-based attention budgeting. By holding all other experimental factors constant, this design enables a focused assessment of whether dynamic, token-level attention regulation grounded in cognitive load principles improves learning efficiency and generalization compared to conventional transformer attention.
2.2.5. Experimental Setup
All experiments follow a controlled and reproducible training protocol designed to isolate the effects of cognitive-load-based attention modulation. Unless otherwise stated, architectural configurations, optimization settings, and training procedures are held constant across all model variants.
In this study, models are trained from scratch using a lightweight transformer architecture and a simple whitespace tokenizer rather than relying on large pretrained language models. This design choice allows us to evaluate the proposed CLT-based attention mechanism in a simplified and controlled experimental setting where the behavior of the attention module can be examined more directly. While pretrained transformers typically achieve substantially higher absolute accuracy on these benchmarks, the objective of the present experiments is to analyze how cognitively motivated attention constraints influence learning dynamics and attention allocation patterns under comparable architectural and training conditions.
Each model is trained using the Adam optimizer with a fixed learning rate of and weight decay of . A cosine learning rate schedule with linear warm-up is applied, where the warm-up phase spans 5% of the total training steps. Gradient norms are clipped to a maximum value of 1.0 to stabilize optimization.
For all datasets, data are randomly shuffled and partitioned into training (70%), validation (15%), and test (15%) splits using fixed random seeds to ensure reproducibility. Models are trained for up to 10 epochs, with early stopping applied based on validation loss using a patience of five epochs. The model state achieving the lowest validation loss is retained for final evaluation.
To account for stochastic variability arising from random initialization and data shuffling, each model configuration is evaluated across 10 independent random seeds. Reported results are aggregated across these runs and summarized using the mean and standard deviation.
For completeness and full reproducibility,
Table 1 summarizes all architectural, optimization, preprocessing, and training hyperparameters used across datasets.
Model performance is assessed exclusively on the held-out test set using the following primary metrics:
Test loss, measured using cross-entropy loss.
Classification accuracy, computed as the proportion of correctly classified samples.
Training time, measured as total wall-clock runtime per trial.
Epochs executed, indicating the number of training epochs completed prior to early stopping.
In addition to performance metrics, we collect a set of attention allocation diagnostics for models employing CLT-based attention mechanisms. These include mean attention entropy, average allocated attention budget, the proportion of tokens operating at minimum and maximum budget constraints, and a rank-based correlation between cognitive load estimates and outgoing attention mass. Entropy is computed as the Shannon entropy of post-softmax attention distributions and is averaged across heads, tokens, batches, and random seeds. All experiments use a fixed maximum sequence length of tokens. Entropy values reported in tables are not normalized by . Under a uniform attention distribution over positions, entropy equals , which evaluates to . The baseline model consistently exhibits near-uniform attention allocation, leading to entropy values close to this theoretical maximum. The extremely small variance observed in aggregated entropy statistics reflects averaging across large numbers of tokens and trials rather than a degenerate or constant diagnostic. These diagnostics are used solely for post hoc analysis of attention behavior and do not influence training or model selection. In particular, they provide observable indicators of how attention is allocated in the baseline transformer—where allocation is determined only by similarity-based interactions—and allow comparison with CLT-enabled models that introduce explicit token-level resource constraints.
All models are implemented in PyTorch and trained using GPU acceleration when available. Random seeds are fixed across Python 3.12.7, NumPy 1.26.4, and PyTorch 2.5.1 to ensure deterministic behavior where possible.
2.2.6. Computational Complexity
The proposed Cognitive Load Theory (CLT)-based attention mechanisms are designed to operate entirely within the standard transformer attention pipeline, without introducing additional forward passes, auxiliary models, or post hoc explanation procedures. As a result, the overall computational complexity of training and inference remains dominated by the cost of conventional multi-head self-attention.
For a transformer layer with
attention heads, sequence length
, and key/query dimensionality
, the standard scaled dot-product self-attention mechanism incurs a computational cost of
which dominates both training and inference in transformer-based models.
The CLT mechanism augments attention by computing a token-level cognitive load signal during the forward pass and using this signal to softly constrain attention allocation. Importantly, all quantities required for cognitive load estimation are derived from intermediate values already computed as part of standard model execution. Specifically:
Attention entropy is computed directly from normalized attention weights, requiring operations per token.
Prediction margin, defined as the difference between the top two class probabilities, is computed from model logits with constant-time overhead.
Lexical importance, when used, is obtained via constant-time lookup from a precomputed inverse document frequency (IDF) table.
The subsequent mapping from cognitive load to an allowable attention budget and the rescaling of attention probabilities involve only elementwise operations and normalization, incurring linear complexity in the sequence length. Consequently, the additional overhead introduced by CLT per attention layer is , which is asymptotically negligible relative to the cost of attention computation.
Crucially, CLT does not modify the structure of the attention matrix, alter the number of attention heads, or introduce auxiliary optimization steps. Therefore, both training and inference retain the same asymptotic complexity as the baseline transformer model. This contrasts with prior feature-attribution-guided optimization methods, such as SHAP-based approaches, which introduce additional computational stages or surrogate models to estimate feature importance.
Empirically, the computational efficiency of CLT-based models is reflected in the reported training time and early stopping behavior across experimental runs. These runtime measurements, reported alongside accuracy and loss metrics, provide practical confirmation that CLT attention constraints do not impose substantial computational overhead beyond standard transformer training.