A Cognitive Load Theory-Informed Attention Mechanism for Transformer-Based Text Classification

Graham, Jarrod; Sheng, Victor S.

doi:10.3390/math14071133

Open AccessFeature PaperArticle

A Cognitive Load Theory-Informed Attention Mechanism for Transformer-Based Text Classification

by

Jarrod Graham

and

Victor S. Sheng

^*

Department of Computer Science, Texas Tech University, Lubbock, TX 79409, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(7), 1133; https://doi.org/10.3390/math14071133

Submission received: 8 February 2026 / Revised: 24 March 2026 / Accepted: 26 March 2026 / Published: 28 March 2026

(This article belongs to the Special Issue AI, Machine Learning and Optimization)

Download

Browse Figure

Versions Notes

Abstract

We propose a Cognitive Load Theory (CLT)-informed attention mechanism for transformer-based text classification. The proposed attention mechanism computes a per-token cognitive-load signal—derived from attention entropy, margin-based classification uncertainty, and optional inverse document frequency—and maps this signal to a learnable attention “budget” that scales outgoing attention mass during decoding. Unlike architectural efficiency techniques such as Multi-Query or Grouped-Query Attention, the CLT mechanism requires no structural modifications and introduces only modest per-step computational overhead while preserving full compatibility with standard transformer architectures. Experiments across four datasets (IMDB, AG News, SST-2, and DBpedia) show that CLT-informed attention achieves accuracy comparable to or exceeding a fixed-budget baseline while delivering consistently lower test loss, faster convergence to the best validation checkpoint, reduced attention entropy, and strong alignment between cognitive load and attention mass. Among all variants, an entropy-only load signal yields the most stable and consistent performance across datasets. These results demonstrate that lightweight, cognitively motivated constraints can structure transformer attention while maintaining or improving downstream classification performance.

Keywords:

transformers; optimization; attention; performance evaluation

MSC:

68T05

1. Introduction

1.1. Background

The introduction of the Transformer architecture fundamentally reshaped natural language processing by replacing recurrence with parallelizable self-attention mechanisms [1]. Unlike recurrent neural networks (RNNs), which process tokens sequentially and struggle with long-range dependencies, transformers compute pairwise token interactions in parallel through a scaled dot-product attention mechanism. Each token is projected into query, key, and value vectors, and multiple attention heads enable the model to capture diverse syntactic and semantic relations within an input sequence [1]. The effectiveness of this architecture has been demonstrated across a wide range of NLP tasks, including machine translation, language modeling, and text classification [2].

While self-attention provides a powerful mechanism for encoding context, standard transformers allocate attention mass based solely on learned similarity scores and do not incorporate dynamic resource constraints. That is, the model does not adapt its attention behavior based on token-level difficulty, uncertainty, or contextual complexity. In biological and cognitive systems, however, attention is fundamentally a mechanism for resource allocation under conditions of limited processing capacity. Cognitive Load Theory (CLT) posits that learning and decision-making efficiency depend on the management of intrinsic, extraneous, and germane cognitive load, with excessive load impairing processing performance [3]. Cognitive models of attention similarly emphasize selective and context-sensitive allocation of limited cognitive resources in response to task demands [4].

In standard transformer architectures, attention weights are determined exclusively through softmax-normalized similarity scores between queries and keys. While this mechanism enables flexible context aggregation, it implicitly assumes that all tokens compete for attention without explicit resource limitations. As a result, attention mass may be distributed broadly across many tokens, even when only a subset of tokens is critical for the prediction task. Importantly, the mechanism does not distinguish between tokens that are easy to process and those that are semantically ambiguous, rare, or difficult to classify. In cognitive systems, however, attentional resources are limited and must be selectively allocated to stimuli that impose higher cognitive demands. This discrepancy motivates the introduction of mechanisms that regulate attention allocation according to token-level processing difficulty rather than relying solely on similarity-based interactions.

This contrast between uniform transformer attention and adaptive human attention suggests an opportunity to incorporate cognitively motivated constraints into neural attention mechanisms. The present work explores this idea by introducing a Cognitive LoadTheory (CLT)-informed attention mechanism. In this framework, a token-level cognitive-load signal—derived from measures such as attention entropy and classification uncertainty—is mapped to a continuous attention budget that restricts or relaxes attention mass distribution during decoding. This design preserves the mathematical form of self-attention and introduces negligible computational overhead, while encouraging more structured and interpretable attention patterns akin to resource-regulated human cognition.

1.2. Related Work

Research on improving Transformer attention can be grouped into three main lines that are relevant to our CLT-informed attention budgets: (i) efficient and long-context attention, (ii) sparse or adaptive attention mechanisms, and (iii) cognitive-load-aware modeling and related adaptive computation strategies.

1.2.1. Efficient and Long-Context Attention

A large body of work focuses on reducing the quadratic cost of self-attention while preserving or improving downstream quality. Linformer projects keys and values into a low-rank space, proving that self-attention can often be well-approximated by a projection of fixed dimension independent of sequence length [5]. Performer replaces exact softmax attention with kernelized “FAVOR+” random feature approximations, yielding attention with linear complexity in sequence length and strong empirical performance [6]. Reformer combines locality-sensitive hashing and reversible layers to reduce both time and memory footprints [7]. For long documents, Longformer and BigBird use sparse patterns—sliding windows, global tokens, and random or block connections—to obtain theoretically expressive yet sub-quadratic attention [8,9].

These methods primarily target computational efficiency: they change the connectivity pattern or approximate the softmax kernel to lower complexity. By contrast, our CLT-informed mechanism keeps the standard dense self-attention structure but modulates attention probabilities through a token-wise “budget” derived from a scalar load estimate. As a result, we remain architecture-compatible with standard Transformers while introducing an explicit notion of limited attentional resources. Consequently, the CLT mechanism is orthogonal to efficient-attention architectures and could in principle be combined with approaches such as Linformer, Performer, or Longformer, which modify attention connectivity to improve computational scalability.

1.2.2. Sparse and Adaptive Attention Mechanisms

Another line of work replaces softmax with sparsity-inducing alternatives or dynamically tunes how much context each position is allowed to attend to. Sparse Transformers use fixed strides and local patterns to reduce the number of attention links while retaining good performance on generative tasks [10]. Adaptive attention span learns customized context windows for each head and layer, shrinking attention for positions that do not benefit from long-range context [11]. Beyond architectural sparsity, alternative normalizations such as sparsemax and entmax yield inherently sparse attention distributions, allowing models to focus on a small subset of tokens and improving interpretability and robustness in sequence-to-sequence tasks [12].

These approaches are “attention-aware” in that they shape how mass is distributed or how far it can spread, but they typically do not tie this to task difficulty or model uncertainty in an interpretable way. Our CLT-based formulation instead constructs a scalar load variable for each token from (i) attention entropy, (ii) the prediction margin between the top two classes, and (iii) inverse document frequency. The resulting budget controller is motivated by cognitive load theory and directly encodes the intuition that high-load tokens should consume more of the limited attention resource, whereas easy or redundant tokens should consume less.

1.2.3. Cognitive-Load-Aware Neural Modeling

Cognitive load theory has been widely used to design instructional materials and to analyze human task performance [3], and more recent work has explored data-driven estimation of mental workload using neural networks. For example, deep and convolutional architectures have been used to estimate cognitive workload from EEG, eye-tracking, and multimodal physiological signals, often incorporating attention modules to highlight informative time segments or channels [13]. Other studies examine cognitive workload in human–computer interaction and virtual reality, training deep models to classify low vs. high workload conditions from behavioral and physiological traces [14]. These systems use neural attention to measure or predict cognitive load, but do not use cognitive-load estimates to control how the model itself allocates internal attention.

In parallel, large-scale industrial LLMs have introduced sophisticated attention and routing schemes to make training and inference more efficient. Recent DeepSeek models employ Multi-Head Latent Attention (MLA) and Mixture-of-Experts routing to reduce memory bandwidth and FLOPs while maintaining competitive or superior performance to other open and closed models [15]. Although these designs also aim to allocate computation more selectively across tokens and experts, the allocation is driven by architectural and optimization considerations rather than an explicit cognitive-theoretic notion of load.

1.2.4. Positioning of Our Contribution

Taken together, existing work (i) redesigns attention to be cheaper or sparser, (ii) adaptively limits context length per head, or (iii) uses attention as a tool to estimate cognitive load from human data. To our knowledge, none of these approaches integrate a CLT-inspired, interpretable load estimate back into the attention mechanism of a Transformer to enforce token-wise “budgets” that depend jointly on attention entropy, classifier confidence, and lexical rarity. Our work therefore connects the efficient-attention and cognitive-load literatures: we retain the expressive power of standard self-attention while explicitly modeling limited attentional resources in a way that is grounded in cognitive load theory and implemented with simple, differentiable operations.

2. Materials and Methods

2.1. Theory and Explanation

Portions of this section are inspired by the formal style and explanatory structure of our prior work on explanation-guided optimization; however, all theoretical formulations presented here are newly developed for the proposed Cognitive Load–informed Transformer (CLT) attention mechanism. Unlike attribution-based optimization methods, which modulate parameter updates using feature-level importance scores, CLT directly alters the attention dynamics of a transformer by introducing a cognitively motivated, token-level constraint on attention allocation.

Throughout this section, we adopt the following notational conventions, consistent with standard practice in the deep learning literature:

Bold lowercase symbols (e.g., x, z_i, q, k, v) denote vectors.
Bold uppercase symbols (e.g., Q, K, V, P) denote matrices or tensors.
Non-bold symbols (e.g., α, β, $L_{i}$ , $B_{i}$ ) denote scalars.
Subscripts $i \in \{1, \dots, S\}$ index token positions in a sequence of length $S$ .
Subscripts $j \in {1, \dots, H}$ index attention heads.
Superscripts $(l)$ index layers when layer-wise behavior is essential.

Unless otherwise stated, we omit the layer index (ℓ) in equations where the context is unambiguous, in order to improve clarity and readability.

2.1.1. Transformer Self-Attention

Let an input sequence be represented as

x = (x_{1}, x_{2}, \dots, x_{S})

where each

x_{i}

is a token index from a vocabulary of size

V

. Tokens are mapped to embeddings using an embedding matrix

E \in R^{V \times d}

, yielding

e_{i} = E_{x_{i}} \in R^{d}

After adding positional encodings

p_{i} \in R^{d}

, the initial sequence representation becomes

z_{i}^{(0)} = e_{i} + p_{i}

In a transformer encoder, representations are updated through stacked layers of multi-head self-attention and feed-forward networks. At a given layer, we collect token representations into a matrix

Z \in R^{S \times d}

. Linear projections produce queries, keys, and values:

Q = Z W_{Q}, K = Z W_{K}, V = Z W_{V}

with learned weight matrices

W_{Q}, W_{K}, W_{V} \in R^{d \times d}

. These matrices are partitioned into

H

heads of dimension

d_{h} = d / H

. For head

j

, scaled dot-product attention yields

A_{j} = \frac{Q_{j} K_{j}^{⊤}}{\sqrt{d_{h}}}, P_{j} = softmax (A_{j})

where

P_{j} \in R^{S \times S}

and each row

P_{j, i :}

represents the attention distribution emitted by token

i

. The attended values are

H_{j} = P_{j} V_{j}

Outputs from all heads are concatenated and linearly transformed to obtain the updated token representations. Residual connections and layer normalization complete the layer update.

2.1.2. Cognitive Load as a Token-Level Computational Signal

In Cognitive Load Theory, effective information processing depends on managing limited cognitive resources. Translating this idea to transformer attention, we view each token as imposing a varying degree of computational demand on the model. Tokens that are ambiguous, rare, or difficult to integrate into the current context should consume more attentional resources, while less informative tokens should exert limited influence.

We formalize this intuition by introducing a token-level cognitive load scalar

L_{i}^{(l)} \in [0, 1]

, defined at each layer

l

. Higher values of

L_{i}^{(l)}

indicate greater processing demand. This load is constructed from three complementary signals:

Attention entropy, capturing how uncertain the model is about where token $i$ should attend.
Margin-based uncertainty, capturing how confident the model is in its current prediction.
Lexical importance, capturing the informativeness of token $i$ via inverse document frequency (IDF).

Each component is normalized and combined into a single scalar load value.

In the terminology of Cognitive Load Theory, the signals used in this formulation primarily approximate intrinsic cognitive load, which arises from the inherent difficulty of processing the information associated with a token. Attention entropy reflects how uncertain the model is about contextual relationships, margin-based uncertainty captures ambiguity in the predicted class, and IDF-based importance reflects the informational density or rarity of lexical items. Together, these signals estimate how demanding a token may be for the model to process within the current context. While extraneous cognitive load in human learning typically arises from suboptimal presentation or task design, the present formulation focuses on intrinsic processing demands within the sequence representation itself.

2.1.3. Components of Cognitive Load

(a): Attention entropy

Let

P_{j}^{(l)}

denote the attention probability matrix for head

j

at layer

l

. For token

i

, the entropy of its outgoing attention distribution in head

j

is

H_{j, i}^{(l)} = - \sum_{k = 1}^{S} P_{j, i k}^{(l)} l o g (P_{j, i k}^{(l)} + ε)

where

ε > 0

ensures numerical stability. We average across heads to obtain a single entropy value per token:

{\bar{H}}_{i}^{(l)} = \frac{1}{H} \sum_{j = 1}^{H} H_{j, i}^{(l)}

This quantity is large when token

i

distributes attention diffusely and small when attention is concentrated. To ensure comparability across tokens, we normalize entropy within the sequence:

H_{i}^{(l)} = \frac{{\bar{H}}_{i}^{(l)} - \underset{r}{m i n} {\bar{H}}_{r}^{(l)}}{\underset{r}{m a x} {\bar{H}}_{r}^{(l)} - \underset{r}{m i n} {\bar{H}}_{r}^{(l)} + ε}

(b): Margin-based uncertainty

Let

l^{(l)} \in R^{C}

denote the class logits produced from the pooled representation at layer

l

, and let

l_{1}^{(l)}

and

l_{2}^{(l)}

be the largest and second-largest logits, respectively. The classification margin is

m^{(l)} = l_{1}^{(l)} - l_{2}^{(l)}

Small margins indicate high uncertainty. We convert this to a normalized uncertainty score:

U^{(l)} = 1 - \frac{m^{(l)} - m_{m i n}}{m_{m a x} - m_{m i n} + ε}

where

m_{m i n}

and

m_{m a x}

are computed within the mini-batch or tracked as running statistics. This scalar is broadcast to all tokens:

U_{i}^{(l)} = U^{(l)}

This reflects the assumption that when the model is uncertain about the instance as a whole, all tokens contribute to cognitive load.

(c): IDF-based lexical importance

For each vocabulary token

v

, inverse document frequency is computed from the training corpus:

IDF (v) = l o g \frac{N}{1 + df (v)}

where

N

is the number of documents and

df (v)

is the document frequency. IDF values are normalized to

[0, 1]

across the vocabulary. For token position

i

with token index

x_{i}

, we define

I_{i} = \tilde{IDF} (x_{i})

Rare, information-dense tokens thus receive higher lexical importance.

(d): Combined cognitive load

The final cognitive load for token

i

at layer

l

is

L_{i}^{(l)} = w_{E} H_{i}^{(l)} + w_{M} U_{i}^{(l)} + w_{I} I_{i}, w_{E} + w_{M} + w_{I} = 1

where

w_{E}, w_{M}, w_{I} \geq 0

are fixed weights. In our experiments, we explore entropy-only and mixed variants of this formulation.

2.1.4. Attention Budgets and CLT-Constrained Attention

To translate cognitive load into a constraint on attention allocation, we introduce a token-wise attention budget

B_{i}^{(l)}

. Let

b_{\min}

and

b_{\max}

denote lower and upper bounds on allowable attention mass. We define

B_{i}^{(l)} = b_{\min} + (b_{\max} - b_{\min}) L_{i}^{(l)}

Tokens with higher cognitive load are granted larger budgets, allowing them to exert greater influence on the attention mechanism.

Let

P_{j}^{(l)}

denote the original attention probabilities for head

j

. CLT rescales each row as

{\hat{P}}_{j, i k}^{(l)} = γ_{i}^{(l)} P_{j, i k}^{(l)}, γ_{i}^{(l)} = \min (1, \frac{B_{i}^{(l)}}{\sum_{k} P_{j, i k}^{(l)} + ε})

This ensures that the total outgoing attention mass from token

i

satisfies

\sum_{k} {\hat{P}}_{j, i k}^{(l)} \leq B_{i}^{(l)}

while preserving the relative distribution over keys. The scaled probabilities

{\hat{P}}_{j}^{(l)}

are then used to compute attended values. Unlike standard attention normalization, the scaled attention weights are not renormalized to sum to one. Instead, the outgoing attention mass is intentionally bounded by the token-specific budget

B_{i}^{(l)}

, allowing the model to regulate how much influence each token can exert on the aggregated representation.

2.1.5. Training Integration of CLT-Constrained Attention

The CLT mechanism is fully differentiable and integrated directly into the forward pass. During training, cognitive load and budgets are recomputed at each layer and iteration, allowing attention patterns to adapt dynamically as representations evolve. Importantly, CLT modifies only attention probabilities, leaving the optimizer, loss function, and model architecture unchanged.

By constraining attention through cognitively motivated budgets, CLT encourages the model to allocate representational capacity selectively, stabilizing attention entropy and improving calibration without increasing computational complexity.

This formulation provides a principled bridge between Cognitive Load Theory and transformer attention. In contrast to architectural efficiency methods or post hoc interpretability techniques, CLT introduces an explicit, interpretable resource constraint that shapes attention behavior throughout training and inference.

2.2. Methodology

This section describes the experimental protocol used to evaluate the proposed Cognitive Load-Informed Transformer (CLT) attention mechanism. While the underlying training and evaluation pipeline follows standard practices for transformer-based text classification, it is modified to incorporate CLT-specific attention budgeting within the model’s forward pass. The methodology is designed to isolate the effects of cognitive load–based attention constraints while maintaining consistency across datasets, model configurations, and optimization settings.

To ensure rigorous and reproducible evaluation, all experiments are conducted using a unified workflow that includes controlled data preprocessing, a consistent model architecture, systematic hyperparameter tuning, and repeated trials with randomized data splits. CLT is evaluated against a baseline transformer model trained without cognitive load constraints, allowing direct assessment of its impact on performance, stability, and attention allocation behavior.

The following subsections describe the workflow, datasets, model architecture, optimization procedures, experimental setup, and computational complexity analysis in detail.

2.2.1. Workflow Overview

To provide a comprehensive view of the experimental pipeline, Figure 1 presents a structured overview of the training and evaluation workflow, highlighting where CLT-augmented attention constraints are integrated into the model’s forward pass. The diagram summarizes the end-to-end process, including dataset preparation, model initialization, CLT-augmented training, validation-based early stopping, and final evaluation. While several components of the workflow follow standard transformer training practices, the figure highlights where cognitive load-based attention constraints are integrated into the model’s forward pass.

At the outer level, experiments are organized around a fixed set of predefined model variants, including a baseline transformer, a fixed-budget control, and multiple CLT configurations that differ in attention budget bounds and cognitive load weighting parameters. Each configuration is evaluated independently using repeated trials with different random seeds to account for variance due to initialization and data ordering.

Within each trial, standard preprocessing steps are applied prior to training, including dataset loading, tokenization, vocabulary construction, and computation of IDF statistics using the training split only. Datasets are partitioned into training, validation, and test subsets according to a consistent protocol to ensure comparability across experimental conditions.

Models are initialized using a unified transformer architecture and trained using the Adam optimizer with a cosine learning rate schedule and warmup. During training, each mini-batch undergoes a forward pass through the transformer encoder. When CLT is enabled, token-level cognitive load is computed dynamically within each attention layer using attention entropy, prediction margin uncertainty, and IDF-based lexical importance. These load estimates are mapped to token-specific attention budgets, which constrain the outgoing attention distributions prior to value aggregation. When CLT is disabled, the model reverts to standard multi-head self-attention.

After each training epoch, model performance is evaluated on a held-out validation set. Early stopping is applied based on validation loss to prevent overfitting, and the best-performing model state within each trial is retained. Once training concludes, the selected model is evaluated on the test set, where both task-level performance metrics (e.g., accuracy, loss, and runtime) and CLT-specific allocation metrics (e.g., attention entropy and budget utilization statistics) are computed.

This workflow ensures that all comparisons between baseline and CLT-augmented models are conducted under identical experimental conditions, with differences in performance attributable solely to the inclusion of cognitive load-informed attention constraints.

2.2.2. Data Preprocessing

To evaluate the proposed cognitive load–aware attention mechanism across heterogeneous natural language classification tasks, experiments are conducted on four benchmark datasets: IMDB, AG News, SST-2, and DBpedia. These datasets vary substantially in document length, number of classes, and semantic structure, enabling systematic evaluation of CLT-based attention under diverse linguistic and classification conditions.

To ensure methodological consistency across datasets, a unified partitioning strategy is adopted for IMDB, AG News, and DBpedia. For these datasets, the originally provided training and test splits are first merged into a single corpus and then reshuffled using a fixed random seed. The shuffled corpus is partitioned into 70% training, 15% validation, and 15% test subsets. The validation subset is used exclusively for early stopping during model training, while the test subset is reserved strictly for final performance evaluation. This reshuffling and repartitioning procedure ensures that all datasets are evaluated under a consistent training-to-validation-to-test ratio, eliminating discrepancies introduced by dataset-specific predefined splits and enabling direct comparability of results across tasks. All reshuffling operations are controlled by fixed seeds, and each experiment is repeated across ten seeds to ensure reproducibility and statistical robustness.

The SST-2 dataset requires a different handling procedure due to its distribution within the GLUE benchmark framework. The official GLUE test split does not provide publicly available labels, preventing direct evaluation under the unified reshuffling protocol used for the other datasets. Consequently, the official GLUE training split is used for model training, and the official GLUE validation split is used for final evaluation. No merging or reshuffling of predefined splits is performed for SST-2. This deviation from the unified partitioning strategy is necessary to preserve compatibility with the GLUE benchmark structure while maintaining strict separation between training and evaluation data.

Across all datasets, identical preprocessing steps are applied to ensure fair comparison between baseline and CLT-augmented models. Input text is lowercased and tokenized using a simple whitespace tokenizer. Vocabulary construction is performed using only the training subset of each dataset. Rare tokens are filtered according to a minimum frequency threshold, and an upper bound is imposed on vocabulary size to control memory usage. Sequences are truncated or padded to a fixed maximum length to enable batch processing, and tokens are converted to integer indices for embedding lookup. Inverse document frequency (IDF) statistics are computed exclusively from the training subset and normalized to the range [0, 1]. These IDF values are used only within CLT-enabled models as part of the cognitive load computation. Validation and test data are never used during vocabulary construction or IDF estimation, ensuring strict separation between training and evaluation information.

This unified preprocessing and evaluation protocol ensures that performance differences between models arise solely from differences in attention allocation mechanisms rather than from inconsistencies in data handling or partitioning.

2.2.3. Model Architecture

All experiments in this study employ a unified transformer-based text classification architecture, designed to isolate the effects of cognitive load–aware attention from confounding architectural variations. The same backbone architecture is used across all datasets and experimental conditions, with differences arising only from the attention mechanism configuration (baseline vs. CLT-enabled).

Input Representation

Each input document is represented as a sequence of token indices

x = (x_{1}, x_{2}, \dots, x_{S})

where

S

denotes the fixed maximum sequence length. Tokens are mapped to dense embeddings using a learnable embedding matrix, producing an embedded sequence

E \in R^{S \times d}

where

d

is the model dimensionality.

To preserve token order information, sinusoidal positional encodings are added to the token embeddings prior to entering the transformer encoder stack. Positional encodings are implemented as fixed, non-trainable functions and are added in a gradient-safe manner to ensure compatibility with CLT-modified attention dynamics.

2.: Transformer Encoder Stack

The encoder consists of a stack of

L

identical transformer blocks. Each block comprises:

A multi-head self-attention sublayer, and
A position-wise feedforward sublayer.

Residual connections and layer normalization are applied around each sublayer in the standard transformer configuration. For an input sequence

H^{(l− 1)}

to layer

l

, the block computes:

H^{(l)} = L N (H^{(l− 1)}+ F F N (L N (H^{(l− 1)} + M H A (H^{(l− 1)}))))

where LN denotes layer normalization and FFN denotes the feedforward network.

The feedforward sublayer consists of two linear transformations with a ReLU activation and dropout applied between them. Dropout is applied throughout the network to mitigate overfitting.

3.: Baseline vs. CLT-Enabled Attention

The baseline model employs standard scaled dot-product multi-head self-attention, where attention weights are computed using softmax normalization and are unconstrained beyond probabilistic normalization.

In contrast, CLT-enabled models replace the baseline attention mechanism with a budgeted multi-head attention module, as described in Section 2.1. Importantly, this modification:

preserves the dimensionality and interface of standard attention,
operates entirely within the attention probability space, and
does not alter the surrounding transformer structure.

At each transformer layer in CLT mode, attention probabilities are dynamically reweighted using token-level budget constraints derived from cognitive load signals. These constraints are applied after softmax normalization and before value aggregation, ensuring compatibility with gradient-based optimization.

Aside from this attention modification, all other architectural components—including embeddings, feedforward layers, normalization, and residual connections—remain unchanged between baseline and CLT configurations.

4.: Sequence Pooling and Classification Head

Following the final transformer layer, token representations are aggregated using mean pooling across the sequence dimension:

h_{pool} = \frac{1}{S} \sum_{i = 1}^{S} H_{i}^{(L)}

The pooled representation is passed through a dropout layer and a linear classification head to produce output logits:

z = W h_{pool} + b

where the dimensionality of

z

corresponds to the number of target classes.

For multi-class datasets (AG News, DBpedia), logits are converted to probabilities using the softmax function. For binary classification tasks (IMDB, SST-2), softmax is applied over two output units to maintain a consistent probabilistic interface across experiments.

5.: Architectural Consistency Across Experiments

All datasets share the same model depth, hidden dimensionality, number of attention heads, and feedforward configuration. The only dataset-dependent variations are:

vocabulary size,
number of output classes.

This architectural consistency ensures that observed performance differences arise from attention behavior and training dynamics, rather than from changes in model capacity or representational power.

All models are trained from randomly initialized parameters and do not leverage pretrained transformer weights or pretrained subword tokenization schemes. While pretrained language models typically achieve substantially higher absolute benchmark performance, the present architecture is intentionally trained from scratch to maintain strict control over representational capacity and to isolate the effects of attention allocation mechanisms. Consequently, absolute accuracy values are not intended to compete with state-of-the-art pretrained systems, but rather to provide a consistent basis for comparative analysis between baseline and CLT-enabled configurations.

2.2.4. Attention Mechanism Variants

All models evaluated in this study share an identical training protocol, model architecture, optimizer configuration, and learning rate schedule. The Adam optimizer is used uniformly across all experiments and is not treated as a variable of interest. Consequently, differences in performance can be attributed exclusively to how attention is allocated within the transformer encoder.

We evaluate a baseline transformer model and a family of Cognitive Load Theory (CLT)-based attention mechanisms. These variants differ only in the constraints applied to attention probabilities during the forward pass, enabling a controlled investigation of cognitively informed attention allocation.

Baseline Transformer

The baseline model employs standard scaled dot-product multi-head self-attention without any cognitive load constraints or attention budgeting. Attention probabilities are computed using softmax normalization over scaled query–key dot products, allowing each token to distribute its full attention mass freely across the sequence.

This configuration represents conventional transformer attention behavior and serves as the primary reference model for all comparisons reported in this study.

2.: CLT-Based Adaptive Attention Models

The proposed CLT-based attention mechanisms introduce adaptive constraints on attention allocation by incorporating token-level cognitive load signals computed during the forward pass. For each token, a normalized cognitive load value is estimated and mapped to an allowable attention budget within a predefined range

[b_{m i n}, b_{m a x}]

. Attention probabilities are then softly constrained to respect this budget while preserving differentiability and compatibility with gradient-based training.

We evaluate several CLT configurations that differ in how cognitive load is computed and how restrictive the resulting attention budgets are:

CLT-E (Entropy-Based Load)

Cognitive load is computed exclusively from attention entropy, capturing uncertainty and dispersion in attention distributions. This variant represents the simplest form of adaptive CLT attention.

b.: CLT-E-Tight (Entropy-Based Load with Tighter Budgets)

Identical to CLT-E, but with an increased minimum attention budget

b_{m i n}

, producing tighter constraints and enabling analysis of budget sensitivity.

c.: CLT-E/M/I (Full CLT Models)

Cognitive load is computed as a weighted combination of:

○: attention entropy (E)
○: prediction margin between the top two predicted class probabilities (M)
○: inverse document frequency-based lexical importance (I)

Multiple configurations of these weights and budget bounds are evaluated (e.g., CLT-B030-E40M40I20) to assess how different cognitive load components influence attention allocation and downstream performance.

Each configuration follows a naming convention of the form “CLT-Bxxx-ExxMxxIxx”, where

B

denotes the base attention budget parameter, and

E

,

M

, and

I

represent the relative weights assigned to entropy, margin-based uncertainty, and IDF-based lexical importance, respectively. The weights

E

,

M

, and

I

correspond to normalized coefficients

w_{E}, w_{M}, w_{I}

in the cognitive-load formulation, which are constrained to sum to one, whereas

B

is an independent parameter that controls the allowable attention mass and is not part of the weighting distribution. For brevity, the “CLT–” prefix may be omitted in tables.

Across all CLT variants, the underlying transformer architecture, optimization procedure, and training schedule remain unchanged. CLT operates exclusively by modulating attention distributions within each layer, ensuring that performance differences arise from attention allocation behavior rather than changes in model capacity or optimization dynamics.

The evaluated attention mechanisms range from standard unconstrained transformer attention to adaptive, cognitively informed CLT-based attention budgeting. By holding all other experimental factors constant, this design enables a focused assessment of whether dynamic, token-level attention regulation grounded in cognitive load principles improves learning efficiency and generalization compared to conventional transformer attention.

2.2.5. Experimental Setup

All experiments follow a controlled and reproducible training protocol designed to isolate the effects of cognitive-load-based attention modulation. Unless otherwise stated, architectural configurations, optimization settings, and training procedures are held constant across all model variants.

In this study, models are trained from scratch using a lightweight transformer architecture and a simple whitespace tokenizer rather than relying on large pretrained language models. This design choice allows us to evaluate the proposed CLT-based attention mechanism in a simplified and controlled experimental setting where the behavior of the attention module can be examined more directly. While pretrained transformers typically achieve substantially higher absolute accuracy on these benchmarks, the objective of the present experiments is to analyze how cognitively motivated attention constraints influence learning dynamics and attention allocation patterns under comparable architectural and training conditions.

Each model is trained using the Adam optimizer with a fixed learning rate of

3 \times 10^{- 4}

and weight decay of

10^{- 2}

. A cosine learning rate schedule with linear warm-up is applied, where the warm-up phase spans 5% of the total training steps. Gradient norms are clipped to a maximum value of 1.0 to stabilize optimization.

For all datasets, data are randomly shuffled and partitioned into training (70%), validation (15%), and test (15%) splits using fixed random seeds to ensure reproducibility. Models are trained for up to 10 epochs, with early stopping applied based on validation loss using a patience of five epochs. The model state achieving the lowest validation loss is retained for final evaluation.

To account for stochastic variability arising from random initialization and data shuffling, each model configuration is evaluated across 10 independent random seeds. Reported results are aggregated across these runs and summarized using the mean and standard deviation.

For completeness and full reproducibility, Table 1 summarizes all architectural, optimization, preprocessing, and training hyperparameters used across datasets.

Model performance is assessed exclusively on the held-out test set using the following primary metrics:

Test loss, measured using cross-entropy loss.
Classification accuracy, computed as the proportion of correctly classified samples.
Training time, measured as total wall-clock runtime per trial.
Epochs executed, indicating the number of training epochs completed prior to early stopping.

In addition to performance metrics, we collect a set of attention allocation diagnostics for models employing CLT-based attention mechanisms. These include mean attention entropy, average allocated attention budget, the proportion of tokens operating at minimum and maximum budget constraints, and a rank-based correlation between cognitive load estimates and outgoing attention mass. Entropy is computed as the Shannon entropy of post-softmax attention distributions and is averaged across heads, tokens, batches, and random seeds. All experiments use a fixed maximum sequence length of

S = 256

tokens. Entropy values reported in tables are not normalized by

l n (S)

. Under a uniform attention distribution over

S

positions, entropy equals

l n (S)

, which evaluates to

l n (256) \approx 5.545

. The baseline model consistently exhibits near-uniform attention allocation, leading to entropy values close to this theoretical maximum. The extremely small variance observed in aggregated entropy statistics reflects averaging across large numbers of tokens and trials rather than a degenerate or constant diagnostic. These diagnostics are used solely for post hoc analysis of attention behavior and do not influence training or model selection. In particular, they provide observable indicators of how attention is allocated in the baseline transformer—where allocation is determined only by similarity-based interactions—and allow comparison with CLT-enabled models that introduce explicit token-level resource constraints.

All models are implemented in PyTorch and trained using GPU acceleration when available. Random seeds are fixed across Python 3.12.7, NumPy 1.26.4, and PyTorch 2.5.1 to ensure deterministic behavior where possible.

2.2.6. Computational Complexity

The proposed Cognitive Load Theory (CLT)-based attention mechanisms are designed to operate entirely within the standard transformer attention pipeline, without introducing additional forward passes, auxiliary models, or post hoc explanation procedures. As a result, the overall computational complexity of training and inference remains dominated by the cost of conventional multi-head self-attention.

For a transformer layer with

H

attention heads, sequence length

S

, and key/query dimensionality

d_{k}

, the standard scaled dot-product self-attention mechanism incurs a computational cost of

O (H \cdot S^{2} \cdot d_{k})

which dominates both training and inference in transformer-based models.

The CLT mechanism augments attention by computing a token-level cognitive load signal during the forward pass and using this signal to softly constrain attention allocation. Importantly, all quantities required for cognitive load estimation are derived from intermediate values already computed as part of standard model execution. Specifically:

Attention entropy is computed directly from normalized attention weights, requiring $O (S)$ operations per token.
Prediction margin, defined as the difference between the top two class probabilities, is computed from model logits with constant-time overhead.
Lexical importance, when used, is obtained via constant-time lookup from a precomputed inverse document frequency (IDF) table.

The subsequent mapping from cognitive load to an allowable attention budget and the rescaling of attention probabilities involve only elementwise operations and normalization, incurring linear complexity in the sequence length. Consequently, the additional overhead introduced by CLT per attention layer is

O (S)

, which is asymptotically negligible relative to the

O (S^{2})

cost of attention computation.

Crucially, CLT does not modify the structure of the attention matrix, alter the number of attention heads, or introduce auxiliary optimization steps. Therefore, both training and inference retain the same asymptotic complexity as the baseline transformer model. This contrasts with prior feature-attribution-guided optimization methods, such as SHAP-based approaches, which introduce additional computational stages or surrogate models to estimate feature importance.

Empirically, the computational efficiency of CLT-based models is reflected in the reported training time and early stopping behavior across experimental runs. These runtime measurements, reported alongside accuracy and loss metrics, provide practical confirmation that CLT attention constraints do not impose substantial computational overhead beyond standard transformer training.

3. Results

3.1. IMDB

The IMDB dataset is a binary sentiment classification task involving movie reviews labeled as positive or negative. The dataset presents a challenging natural language setting characterized by long input sequences, semantic ambiguity, and varying degrees of contextual relevance across tokens. These properties make IMDB well suited for evaluating attention mechanisms that adaptively regulate information flow, as cognitively salient tokens may be sparsely distributed throughout the input.

3.1.1. Core Performance Results

Table 2 summarizes the core performance metrics for the baseline transformer model and the evaluated CLT-based attention variants on the IMDB dataset. Reported results include test accuracy, test loss, the epoch index corresponding to the best validation-loss checkpoint, and total training time, each averaged over 10 independent runs. The reported epoch index reflects the point during training at which the minimum validation loss was achieved and the corresponding model state was selected for final evaluation on the test set.

Across all models, CLT-based attention mechanisms demonstrate competitive or improved predictive performance relative to the baseline. In particular, the entropy-based CLT variant (CLT-E) achieves the highest mean accuracy (0.6135 ± 0.0861), exceeding the baseline transformer (0.5569 ± 0.0468). Several CLT configurations also attain lower test loss than the baseline, indicating improved generalization.

3.1.2. Key Performance Observations

Several trends emerge from the IMDB results:

Accuracy and loss trade-offs: CLT-E provides the strongest accuracy improvement over the baseline while maintaining a lower test loss, suggesting that entropy-based cognitive load alone is sufficient to guide effective attention modulation for sentiment classification on IMDB. CLT variants incorporating additional cognitive signals (margin and IDF) exhibit more variable performance, indicating that these components do not uniformly benefit long-form text classification.
Validation-selected checkpoint behavior: The baseline model achieves its minimum validation loss at an earlier epoch on average than most CLT variants, whereas CLT-based models tend to reach their best validation performance at later epochs. This indicates that adaptive attention constraints may require additional training iterations to fully stabilize, even though the final selected model state may occur well before training termination under early stopping.
Runtime efficiency: Despite differences in the epoch at which the best validation checkpoint is obtained, total training times for CLT variants remain comparable to the baseline. Variations in runtime are primarily driven by early stopping behavior rather than additional computational overhead, consistent with the complexity analysis in Section 2.2.5.

3.1.3. Attention Allocation Analysis

To assess whether CLT mechanisms operate as intended, Table 3 reports attention allocation diagnostics computed during evaluation. These metrics provide insight into how cognitive load estimates influence attention behavior.

The baseline transformer exhibits the highest mean attention entropy (5.55), consistent with unconstrained attention distributions, and shows no measurable correlation between cognitive load and attention mass allocation. In contrast, all CLT-based models substantially reduce attention entropy, indicating more focused attention distributions.

Entropy-based CLT variants display particularly strong alignment between cognitive load and attention allocation. CLT-E achieves a high mean Spearman-like correlation between load and attention mass (ρ ≈ 0.90), demonstrating that tokens with higher estimated cognitive load consistently receive greater attention. CLT configurations with tighter or more complex budget formulations exhibit increased saturation at maximum budget levels, reflecting stronger constraints on attention allocation.

These diagnostics confirm that CLT-based mechanisms not only improve predictive performance but also meaningfully alter attention behavior in accordance with cognitive load estimates, providing mechanistic support for the observed performance trends.

3.2. AG News

The AG News dataset is a multi-class news topic classification task, requiring discrimination across short text segments with overlapping vocabulary and domain-specific cues. Compared to IMDB, AG News places greater emphasis on capturing class-defining keywords and local context while remaining robust to stylistic variability across sources. These properties make AG News a useful setting for assessing whether CLT-based attention variants improve classification performance while producing measurable, non-degenerate attention allocation behavior under budget modulation.

3.2.1. Core Performance Results

Table 4 summarizes the core performance metrics for the baseline transformer model and the evaluated CLT-based attention variants on the AG News dataset. Reported results include test accuracy, test loss, the epoch index corresponding to the best validation-loss checkpoint, and total training time, each averaged over 10 independent runs. The reported epoch index reflects the point during training at which the minimum validation loss was achieved, and the corresponding checkpoint was used for final evaluation on the test set.

Across models, CLT-based attention mechanisms yield improved mean performance relative to the baseline. In particular, the entropy-based CLT variant (CLT-E) achieves the highest mean accuracy (0.506 ± 0.063), exceeding the baseline mean (0.389 ± 0.165), while also reducing test loss (1.041 ± 0.061 vs. 1.238 ± 0.191). Although seed-level variability is substantial in this setting, several budget-scheduled CLT configurations demonstrate higher average accuracy and lower test loss than the baseline. The magnitude of improvement varies across budget splits, indicating that allocation strategy can meaningfully influence downstream classification behavior in this higher-entropy, multi-class task.

3.2.2. Key Performance Observations:

Accuracy and loss improvements: CLT-E provides the strongest overall accuracy improvement while loss improvement over the baseline is best with B030-E50M50I00, suggesting that entropy-guided allocation is particularly effective for multi-class topic classification where class evidence can be concentrated in short spans.
Checkpoint behavior and training dynamics: The baseline tends to terminate earlier (lower mean epochs executed) yet exhibits high variability in outcomes, whereas CLT variants typically require more epochs before convergence to their best validation checkpoint, consistent with the added constraints introduced by CLT-based allocation.
Runtime trade-offs: CLT variants generally increase training time relative to the baseline due to longer training trajectories and later checkpoint selection. However, the additional cost is accompanied by improved mean performance relative to the baseline in several configurations, indicating that the runtime increase reflects substantive learning differences rather than purely incidental overhead.

3.2.3. Attention Allocation Analysis

To assess whether CLT mechanisms operate as intended, Table 5 reports attention allocation diagnostics computed during evaluation. These metrics quantify how cognitive load estimates are reflected in attention behavior and include mean attention entropy, mean budget utilization, budget saturation indicators (percentage of time at maximum and minimum budget), and a Spearman-like correlation between cognitive load estimates and allocated attention mass. These diagnostics are aggregated across 10 trials.

As expected, the baseline transformer exhibits trivial allocation statistics under the reported diagnostics, including high attention entropy and no measurable association between cognitive load estimates and attention mass, reflecting the absence of any load-informed allocation mechanism. In contrast, CLT-based variants consistently reduce attention entropy and increase budget utilization, indicating more structured allocation behavior. Among all variants, CLT-E demonstrates the strongest alignment between cognitive load estimates and attention allocation (ρ ≈ 0.996 ± 0.009), providing evidence that the model’s allocation mechanism tightly couples load estimates with attention mass in the intended direction.

Notably, configurations that exhibit stronger budget saturation tendencies show weaker and more variable load–allocation alignment, suggesting that overly aggressive budget behavior can reduce the monotonic relationship between load and attention mass. Overall, these diagnostics support that CLT-based attention variants not only improve predictive performance on AG News but also meaningfully alter attention distributions in accordance with the cognitive load-guided allocation objective.

3.3. SST-2

The SST-2 dataset is a binary sentiment classification benchmark composed of short movie-review snippets labeled as positive or negative. Compared to IMDB, sequences are substantially shorter, making SST-2 a useful setting for evaluating whether CLT-based attention constraints remain beneficial when discriminative evidence is typically concentrated within a smaller portion of the input.

3.3.1. Core Performance Results

Table 6 summarizes the core performance metrics for the baseline transformer model and the evaluated CLT-based attention variants on SST-2. Reported results include test accuracy, test loss, the epoch index corresponding to the best validation-loss checkpoint, and total training time, each aggregated over 10 independent trials. The “epoch” value reflects the training epoch at which the minimum validation loss was achieved and the corresponding model state was selected for final evaluation.

Across methods, the entropy-based CLT variant (CLT-E) achieves the strongest predictive performance, producing the highest mean accuracy (0.580 ± 0.075) and the lowest mean test loss (0.663 ± 0.037) relative to the baseline (0.508 ± 0.007 accuracy; 0.695 ± 0.001 loss). Several budgeted configurations yield modest improvements over baseline accuracy (e.g., B030-E50M50I00 at 0.533 ± 0.055), while others are comparable to baseline (e.g., B050-E40M40I20 at 0.508 ± 0.006). CLT-E-Tight improves over baseline accuracy (0.546 ± 0.061) but does not match CLT-E, suggesting that tighter constraint settings may reduce flexibility needed for optimal sentiment discrimination on short sequences.

3.3.2. Key Performance Observations

Predictive performance: CLT-E provides the clearest benefit, improving both accuracy and loss versus the baseline. In contrast, budgeted variants show mixed outcomes, with some configurations improving accuracy modestly and others remaining near baseline levels.
Checkpoint behavior (best validation epoch): The best validation-loss checkpoint is typically reached early (≈1.9–2.6 epochs on average) across both baseline and CLT variants, indicating that SST-2 converges quickly under the shared training protocol. Variability in the selected epoch is larger for some CLT settings (e.g., CLT-E and CLT-E-Tight), consistent with increased sensitivity to stochasticity across trials.
Runtime differences: Training time is consistently higher for CLT variants (≈507–537 s) than for the baseline (342 ± 49 s). Given that the selected checkpoint epoch is similar across several methods (e.g., baseline and some CLT variants near 1.9), this runtime gap is consistent with additional per-step computation introduced by CLT attention regulation, including cognitive-load estimation and budget mapping operations, in addition to any early-stopping variation. Importantly, this overhead reflects constant-factor scalar computations applied to existing attention tensors rather than architectural expansion, additional projection layers, or increased parameter count.

3.3.3. Attention Allocation Analysis

To assess whether CLT mechanisms alter attention behavior in the intended manner, Table 7 reports attention allocation diagnostics computed during evaluation. These metrics include mean attention entropy, mean budget utilization, budget saturation indicators (percentage of time at maximum/minimum budget), and a Spearman-like correlation between cognitive-load estimates and allocated attention mass, aggregated across trials.

All CLT variants exhibit substantially lower attention entropy than the baseline (baseline entropy 5.545, versus ≈ 2.285–3.803 for CLT variants), indicating that CLT attention mechanisms consistently induce more concentrated attention distributions. Alignment between cognitive-load estimates and attention mass is strongest for B050-E40M40I20 (ρ = 1.000 ± 0.001) and B030-E40M40I20 (ρ = 0.990 ± 0.030), while configurations with heavier budget utilization show weaker and more variable alignment (e.g., B030-E50M50I00 at ρ = 0.397 ± 0.513). CLT-E and CLT-E-Tight exhibit moderate-to-strong alignment on average (ρ = 0.672 ± 0.466 and ρ = 0.594 ± 0.512, respectively), though with substantial variability across trials.

Overall, the SST-2 diagnostics indicate that CLT mechanisms reduce attention entropy and, in several configurations, strengthen the coupling between cognitive-load estimates and attention allocation. While end-task accuracy differences are modest, CLT-E achieves the strongest mean performance among the evaluated configurations, suggesting that entropy-driven load signals may provide a stable inductive bias in this binary setting.

3.4. DBpedia

DBpedia is a large-scale topic classification task with many label categories and substantial lexical diversity. Compared to binary sentiment datasets, DBpedia requires the model to allocate attention across broader semantic cues that may be distributed throughout the input, making it a useful setting for evaluating whether CLT-based mechanisms can regulate attention without degrading predictive performance.

3.4.1. Core Performance Results

Table 8 summarizes the core performance metrics for the baseline transformer and CLT-based attention variants on DBpedia. Relative to the baseline, all CLT variants achieve higher mean accuracy and substantially lower test loss, indicating improved generalization in this multi-class setting. The strongest mean accuracy is obtained by B030-E40M40I20 (0.266 ± 0.069), closely followed by B030-E50M50I00 (0.261 ± 0.040). These gains coincide with lower losses for the CLT variants (≈1.88–1.99) compared to the baseline (2.460 ± 0.067), suggesting that CLT-informed attention yields more calibrated predictions across the larger label space.

3.4.2. Key Performance Observations

Improvement over the baseline: All evaluated CLT configurations achieve substantially higher mean accuracy than the baseline (baseline 0.131 ± 0.012 vs. CLT range ≈ 0.216–0.266), while simultaneously reducing test loss. The magnitude of this gap across all runs suggests that the observed gains are not attributable to a loss/accuracy trade-off but reflect a systematic shift in model behavior under CLT-constrained attention.
Convergence and checkpoint timing: CLT variants again tend to reach their best validation-loss checkpoint later than the baseline (≈5–9 epochs vs. 2.9 epochs). This is consistent with attention regulation acting as an additional constraint that may delay convergence to the best-generalizing solution under early stopping, even as final test performance improves.
Runtime driven by training dynamics rather than mechanism overhead: Training time is high for DBpedia across all methods (≈4.7–5.7 k seconds), but differences across CLT variants track differences in the epoch index of the best checkpoint rather than any structural computational burden introduced by CLT attention. The baseline is faster primarily because it terminates earlier and reaches its selected checkpoint sooner.

3.4.3. Attention Allocation Analysis

To assess whether CLT mechanisms operate as intended, Table 9 reports attention allocation diagnostics computed during evaluation. Across all CLT variants, mean attention entropy is lower than the baseline (5.545), indicating more concentrated attention distributions. In parallel, budget utilization increases substantially above the baseline level (0.300), with most CLT variants operating in the ~0.53–0.75 range, reflecting active use of the allocation constraints.

Alignment between cognitive-load estimates and attention allocation mass is also strong for most CLT configurations. In particular, B030-E50M50I00 and CLT-E-Tight show Spearman-like correlations near 1.0, indicating that tokens estimated as more cognitively salient systematically receive greater attention mass. Budget saturation behavior further differentiates variants: some configurations rarely saturate at the maximum budget (e.g., B030-E40M40I20), while others exhibit higher time at maximum allocation (e.g., B050-E50M50I00), reflecting stricter constraint engagement. Overall, these diagnostics indicate that CLT-based attention improves predictive performance while producing measurable and coherent shifts in attention concentration, budget usage, and cognitive-load/attention alignment.

4. Discussion

This study investigated the role of cognitive-load-informed attention mechanisms in transformer models, evaluating whether explicitly constraining attention allocation according to principles from Cognitive Load Theory (CLT) can improve learning efficiency, generalization, and mechanistic transparency across diverse natural language processing tasks. By systematically varying entropy-based and budget-based attention formulations, we examined how different cognitive constraints shape both predictive performance and internal attention allocation behavior.

Across all evaluated datasets, CLT-based attention mechanisms consistently outperformed the baseline transformer, demonstrating that attention regulation informed by cognitive principles provides measurable benefits beyond unconstrained self-attention. These gains were particularly pronounced on datasets characterized by long input sequences and heterogeneous semantic content (e.g., IMDB, AG News, DBpedia), where unconstrained attention is more likely to diffuse across irrelevant tokens. In such settings, CLT-informed models achieved improved accuracy and reduced test loss, suggesting that limiting attention dispersion encourages more effective use of task-relevant evidence under capacity constraints—consistent with classic accounts of limited attentional resources and effort allocation [16,17].

A key finding of this work is that entropy-based CLT variants (CLT-E) yielded the most robust and consistent improvements across tasks. By penalizing high-entropy attention distributions, these models promoted sharper and more selective attention patterns without imposing rigid token-level constraints. This behavior aligns closely with CLT’s central claim that learning benefits when extraneous cognitive load is reduced while maintaining resources for task-relevant processing [3,18]. The strong performance of CLT-E across binary and multi-class classification tasks suggests that entropy-driven attention regularization provides a flexible, architecture-compatible mechanism for controlling attention dispersion in a manner consistent with cognitive theory.

Budget-based CLT variants further revealed important trade-offs between attention expressiveness and constraint strength. Configurations with tighter budget enforcement exhibited lower attention entropy and higher budget saturation, indicating stronger concentration of attention mass. While these models often demonstrated stable generalization, excessively tight constraints occasionally reduced performance, particularly on tasks requiring distributed semantic reasoning. This observation reflects a familiar CLT design principle: constraints must be calibrated so that regulation reduces extraneous load without suppressing information integration needed for comprehension [19]. Similar regulation–expressiveness trade-offs have also been observed in attention mechanisms that explicitly control sparsity or span, where stronger structural constraints can improve efficiency but may limit long-range contextual aggregation when applied too aggressively [11,12].

Beyond predictive metrics, attention allocation diagnostics provided strong evidence that CLT-based mechanisms altered internal model behavior in systematic ways. In contrast to the baseline transformer, which exhibited high attention entropy and no alignment between attention mass and cognitive load estimates, CLT-based models showed substantial correlation between estimated cognitive load and allocated attention across datasets. This result is important because it provides behavioral validation that the imposed constraints actually shaped allocation in the intended direction, rather than only changing accuracy as a side effect. At the same time, we emphasize that attention weights should not be treated as standalone explanations of model decisions [20]. Instead, in this work, attention statistics are used as a mechanistic probe whose interpretation is strengthened by explicit constraints and by quantitative alignment with a theory-grounded signal (cognitive-load estimates), consistent with the view that attention can be informative when evaluated under appropriate conditions [21].

Dataset-dependent effects further clarify when CLT-based attention is most beneficial. On shorter or more locally determined tasks (e.g., SST-2), performance gains over the baseline were more modest, reflecting a reduced need for strong allocation control when semantic structure is relatively uniform. In contrast, complex datasets with many classes and longer documents (e.g., DBpedia) benefited substantially from attention constraint mechanisms, reinforcing the idea that selective allocation becomes increasingly valuable as informational load rises. This pattern is consistent with cognitive findings that selective attention and control demands increase under high information load and competition [4,17].

From a computational perspective, CLT-based attention mechanisms introduced minimal overhead relative to standard transformer training, since they modify attention scoring/allocation rather than relying on approximate attention kernels or sparse-pattern search procedures. Differences in runtime were largely attributable to early-stopping dynamics rather than per-step complexity, consistent with the analysis in Section 2.2.5. This efficiency contrasts with many efficient-transformer variants that alter attention computation through approximation or architectural restructuring to reduce asymptotic cost [5,6,7,8,9].

The observation that CLT-enabled models sometimes require more epochs to reach the optimal validation checkpoint reflects the effect of introducing structured constraints on attention allocation rather than instability in the optimization process. In the baseline transformer, attention distributions are unconstrained and may quickly converge to locally effective but diffuse allocation patterns. In contrast, CLT-based models regulate attention mass through token-level budgets derived from cognitive-load estimates, restricting overly entropic distributions and encouraging more selective evidence aggregation. Because the model must learn to operate within these allocation constraints, early training may proceed more gradually before stable attention patterns emerge. Similar behavior is commonly observed in regularized learning systems (e.g., dropout or entropy penalties), where additional constraints can slow early convergence while improving the stability and generalization of the learned representations. Thus, the slightly longer training trajectories observed in some configurations reflect the regularizing influence of the constraints rather than increased optimization noise.

Several limitations warrant discussion. First, cognitive load estimates in this study were derived from proxy measures rather than direct neurophysiological signals. While prior work demonstrates links between workload and neural activation patterns measurable via EEG or fNIRS, integrating physiological workload signals could provide a stronger grounding for the load estimates used to drive or evaluate constrained attention [13,14]. Second, the evaluated models focus on transformer-based text classification. Extending CLT-guided attention to sequence-to-sequence tasks, multimodal architectures, or large-scale pretraining regimes remains an important direction for future research.

Future work should explore adaptive CLT mechanisms, where attention budgets or entropy penalties evolve across layers or training stages. Such strategies may reduce unnecessary restriction during early representation learning while enforcing stronger regulation once attention patterns stabilize. In addition, integrating CLT-based attention with deployment-oriented transformer pipelines (e.g., efficient long-context processing or controllable span mechanisms) may offer a practical path toward cognitively informed and resource-efficient NLP systems [5,6,7,8,9,11]. Another direction is to study how CLT-constrained attention interacts with efficient or sparse attention architectures (e.g., Linformer, Performer, or Longformer), which modify attention connectivity to improve computational scalability. Finally, extending the approach to pretrained transformer models and incorporating refined padding-aware masking strategies may help evaluate whether CLT-based attention constraints remain beneficial in larger, pre-trained representation regimes and lead to stronger absolute predictive performance.

This study demonstrated that incorporating principles from Cognitive Load Theory into transformer attention mechanisms yields measurable benefits in predictive performance and produces attention-allocation patterns that more closely align with a theory-grounded notion of cognitive salience. By regulating how attention is distributed under constrained capacity, CLT-based models achieve more focused information routing, particularly in high-complexity regimes. These findings motivate further work on theory-driven attention designs that more tightly connect cognitive constraints, empirical allocation diagnostics, and modern deep learning practice [3,4,18,19].

Author Contributions

Conceptualization, J.G.; Formal analysis, J.G.; Investigation, J.G.; Methodology, J.G. and V.S.S.; Project administration, V.S.S.; Software, J.G.; Writing—original draft, J.G.; Writing—review & editing, J.G. and V.S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in https://ai.stanford.edu/~amaas/data/sentiment/, https://huggingface.co/datasets/SetFit/ag_news, https://huggingface.co/datasets/fancyzhx/dbpedia_14, https://huggingface.co/datasets/glue/viewer/sst2 (accessed on 16 November 2025). The implementation code used to generate the experimental results, including model definitions, training procedures, and CLT-based attention variants, is publicly available at: https://github.com/Jarrod828/clt_attention (accessed on 20 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Sweller, J. Cognitive Load Theory, Learning Difficulty, and Instructional Design. Learn. Instr. 1994, 4, 295–312. [Google Scholar] [CrossRef]
Lavie, N.; Hirst, A.; de Fockert, J.W.; Viding, E. Load Theory of Selective Attention and Cognitive Control. J. Exp. Psychol. Gen. 2004, 133, 339–354. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-Attention with Linear Complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar] [CrossRef]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, Ł.; et al. Rethinking Attention with Performers. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021. [Google Scholar]
Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The Efficient Transformer. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar] [CrossRef]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. In Proceedings of the Findings of the Association for Computational Linguistics (ACL), Seattle, WA, USA, 5–10 July 2020. [Google Scholar] [CrossRef]
Zaheer, M.; Guruganesh, G.; Dubey, A.; Ainslie, J.; Alberti, C.; Ontañón, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big Bird: Transformers for Longer Sequences. Adv. Neural Inf. Process. Syst. 2020, 33, 17283–17297. [Google Scholar]
Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating Long Sequences with Sparse Transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar] [CrossRef]
Sukhbaatar, S.; Grave, E.; Bojanowski, P.; Joulin, A. Adaptive Attention Span in Transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019. [Google Scholar] [CrossRef]
Martins, A.F.T.; Astudillo, R.F. From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification. In Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016. [Google Scholar] [CrossRef]
Gupta, A.; Siddhad, G.; Pandey, V.; Roy, P.P.; Kim, B.-G. Subject-Specific Cognitive Workload Classification Using EEG-Based Functional Connectivity and Deep Learning. Sensors 2021, 21, 6710. [Google Scholar] [CrossRef] [PubMed]
Herff, C.; Heger, D.; Fortmann, O.; Hennrich, J.; Putze, F.; Schultz, T. Mental Workload during N-Back Tasks Quantified in the Prefrontal Cortex Using fNIRS. Front. Hum. Neurosci. 2013, 7, 935. [Google Scholar] [CrossRef] [PubMed]
Wu, F.; Ren, Z.; Sha, Z.; Fu, Z.; Xu, Z.; Huang, Z.; Zhang, Z.; Xie, Z.; Ma, Z.; Zhang, Z.; et al. DeepSeek-V3 Technical Report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
Kahneman, D. Attention and Effort; Prentice-Hall: Englewood Cliffs, NJ, USA, 1973. [Google Scholar]
Dehaene, S.; Changeux, J.-P. Experimental and Theoretical Approaches to Conscious Processing. Neuron 2011, 70, 200–227. [Google Scholar] [CrossRef] [PubMed]
Sweller, J.; van Merriënboer, J.J.G.; Paas, F.G.W.C. Cognitive Architecture and Instructional Design. Educ. Psychol. Rev. 1998, 10, 251–296. [Google Scholar] [CrossRef]
Paas, F.; Renkl, A.; Sweller, J. Cognitive Load Theory and Instructional Design: Recent Developments. Educ. Psychol. 2003, 38, 1–4. [Google Scholar] [CrossRef]
Jain, S.; Wallace, B.C. Attention Is Not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, MN, USA, 2–7 June 2019; pp. 3543–3556. [Google Scholar] [CrossRef]
Wiegreffe, S.; Pinter, Y. Attention Is Not Not Explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, China, 3–7 November 2019; pp. 11–20. [Google Scholar] [CrossRef]

Figure 1. Training and evaluation workflow for CLT-based attention models. The figure highlights how cognitive load-informed attention constraints are incorporated into the transformer training loop. When enabled, token-level cognitive load estimates derived from entropy, uncertainty, and lexical importance are mapped to attention budgets that constrain attention distributions during the forward pass. Standard optimization, validation-based early stopping, and final evaluation are applied uniformly across all model variants.

Table 1. Experimental Configuration Summary.

Category	Value
Embedding dimension (d_model)	256
Maximum sequence length (S, max_len)	256
Transformer layers (n_layers)	2
Attention heads (n_heads)	4
Feedforward dimension	1024 (=4 × d_model)
Dropout	0.1
Pooling method	Mean pooling over sequence tokens
Classification head	Linear-layer (d_model->#classes)
Optimizer	Adam
Learning rate	3 × 10⁻⁴
Weight decay	1 × 10⁻⁴
LR scheduler	Cosine decay with 5% warmup
Max epochs	10
Early stopping patience	5 (based on val loss)
Gradient clipping	Global norm 1.0
Batch size	32
Seeds (trials)	10 (42–51)
Tokenizer	Lowercased whitespace tokenizer
Vocabulary construction	Training split only
Min token frequency	2
Max vocab size	40,000
Padding token ID	0
Unknown token ID	1
IDF computation	Training split only; min-max normalized to [0, 1]

A table showing the complete experimental configuration summary. Unless otherwise specified, all values are identical across datasets and model variants.

Table 2. Core Metrics for IMDB.

Method	Accuracy	Test Loss	Best Val Loss	Epochs Run	Training Time
Baseline	0.557 ± 0.047	0.689 ± 0.001	0.688 ± 0.001	1.500 ± 0.527	226.759 ± 18.431
B030-E40M40I20	0.562 ± 0.056	0.683 ± 0.019	0.683 ± 0.018	2.500 ± 2.677	181.525 ± 30.274
B030-E50M50I00	0.612 ± 0.063	0.652 ± 0.050	0.654 ± 0.049	4.800 ± 4.104	207.648 ± 45.833
B050-E40M40I20	0.550 ± 0.049	0.689 ± 0.001	0.688 ± 0.001	1.500 ± 0.527	169.219 ± 13.893
B050-E50M50I00	0.593 ± 0.058	0.668 ± 0.037	0.668 ± 0.036	4.200 ± 4.022	200.355 ± 42.362
CLT-E	0.613 ± 0.086	0.645 ± 0.070	0.644 ± 0.071	3.600 ± 3.307	200.577 ± 42.269
CLT-E-Tight	0.578 ± 0.060	0.676 ± 0.031	0.677 ± 0.029	3.200 ± 3.615	187.362 ± 40.330

Summary of average core performance metrics across attention mechanism variants on the IMDB dataset. Each method reports the mean and standard deviation of test accuracy, test loss, best validation loss, epochs executed (until early stopping), and training time across 10 trials. The comparison includes multiple CLT-based attention configurations (e.g., budget–encoding–memory–inhibition weight settings and CLT-E variants). The baseline model is used as the reference condition. Bold values indicate the best-performing method for each metric, while italic values indicate the second-best performance.

Table 3. Attention Allocation for IMDB.

Method	Entropy	Budget	Budget Max	Budget Min	SL-Correlation
Baseline	5.545 ± 0.000	0.300 ± 0.000	1.000 ± 0.000	1.000 ± 0.000	0.000 ± 0.000
B030-E40M40I20	3.291 ± 0.797	0.517 ± 0.113	0.005 ± 0.000	0.250 ± 0.143	0.988 ± 0.022
B030-E50M50I00	3.853 ± 1.292	0.663 ± 0.257	0.355 ± 0.455	0.307 ± 0.479	0.700 ± 0.483
B050-E40M40I20	4.296 ± 0.516	0.699 ± 0.075	0.005 ± 0.000	0.150 ± 0.152	0.992 ± 0.007
B050-E50M50I00	4.413 ± 0.744	0.767 ± 0.153	0.327 ± 0.391	0.215 ± 0.414	0.799 ± 0.421
CLT-E	4.375 ± 0.891	0.811 ± 0.071	0.263 ± 0.300	0.053 ± 0.115	0.898 ± 0.316
CLT-E-Tight	4.751 ± 0.638	0.880 ± 0.061	0.260 ± 0.297	0.026 ± 0.035	0.899 ± 0.316

Summary of attention allocation and cognitive-load alignment metrics across attention mechanism variants on the IMDB dataset. Each method reports the mean and standard deviation of attention entropy, mean attention budget utilization, budget saturation indicators (percentage of time at maximum and minimum budget), and a Spearman-like correlation between cognitive load estimates and allocated attention mass, aggregated across 10 trials. These allocation metrics characterize how each CLT variant distributes attention under budget constraints, relative to the baseline model.

Table 4. Core Metrics for AG News.

Method	Accuracy	Test Loss	Best Val Loss	Epochs Run	Training Time
Baseline	0.389 ± 0.165	1.238 ± 0.191	1.238 ± 0.191	1.400 ± 0.966	519.999 ± 78.684
B030-E40M40I20	0.422 ± 0.063	1.127 ± 0.111	1.126 ± 0.112	5.700 ± 4.138	965.490 ± 196.826
B030-E50M50I00	0.464 ± 0.020	1.038 ± 0.085	1.039 ± 0.086	6.700 ± 4.296	1000.132 ± 201.619
B050-E40M40I20	0.395 ± 0.083	1.198 ± 0.135	1.199 ± 0.134	5.800 ± 4.442	953.044 ± 209.982
B050-E50M50I00	0.425 ± 0.065	1.122 ± 0.121	1.122 ± 0.120	3.200 ± 3.645	826.733 ± 186.004
CLT-E	0.506 ± 0.063	1.041 ± 0.061	1.042 ± 0.063	3.000 ± 3.712	811.980 ± 193.104
CLT-E-Tight	0.441 ± 0.085	1.146 ± 0.135	1.145 ± 0.136	3.000 ± 3.712	829.317 ± 192.997

Summary of average core performance metrics across attention mechanism variants on the AG News dataset. Each method reports the mean and standard deviation of test accuracy, test loss, best validation loss, epochs executed (until early stopping), and training time across 10 trials. The comparison includes multiple CLT-based attention configurations (e.g., budget–encoding–memory–inhibition weight settings and CLT-E variants). The baseline model is used as the reference condition. Bold values indicate the best-performing method for each metric, while italic values indicate the second-best performance.

Table 5. Attention Allocation for AG News.

Method	Entropy	Budget	Budget Max	Budget Min	SL-Correlation
Baseline	5.545 ± 0.000	0.300 ± 0.000	1.000 ± 0.000	1.000 ± 0.000	0.000 ± 0.000
B030-E40M40I20	2.791 ± 0.557	0.413 ± 0.072	0.005 ± 0.000	0.522 ± 0.362	0.941 ± 0.063
B030-E50M50I00	3.552 ± 0.602	0.611 ± 0.100	0.067 ± 0.042	0.220 ± 0.285	0.891 ± 0.313
B050-E40M40I20	3.651 ± 0.395	0.582 ± 0.054	0.005 ± 0.003	0.490 ± 0.396	0.943 ± 0.065
B050-E50M50I00	4.289 ± 0.780	0.793 ± 0.133	0.261 ± 0.360	0.120 ± 0.257	0.803 ± 0.405
CLT-E	3.990 ± 0.506	0.691 ± 0.101	0.105 ± 0.261	0.052 ± 0.069	0.996 ± 0.009
CLT-E-Tight	4.464 ± 0.537	0.777 ± 0.111	0.192 ± 0.374	0.075 ± 0.140	0.975 ± 0.060

Summary of attention allocation and cognitive-load alignment metrics across attention mechanism variants on the AG News dataset. Each method reports the mean and standard deviation of attention entropy, mean attention budget utilization, budget saturation indicators (percentage of time at maximum and minimum budget), and a Spearman-like correlation between cognitive load estimates and allocated attention mass, aggregated across 10 trials. These allocation metrics characterize how each CLT variant distributes attention under budget constraints, relative to the baseline model.

Table 6. Core Metrics for SST-2.

Method	Accuracy	Test Loss	Best Val Loss	Epochs Run	Training Time
Baseline	0.508 ± 0.007	0.695 ± 0.001	0.695 ± 0.001	1.900 ± 0.994	341.865 ± 49.377
B030-E40M40I20	0.527 ± 0.044	0.688 ± 0.016	0.688 ± 0.016	1.900 ± 0.876	507.641 ± 64.395
B030-E50M50I00	0.533 ± 0.055	0.684 ± 0.023	0.684 ± 0.023	2.000 ± 1.054	515.586 ± 77.773
B050-E40M40I20	0.508 ± 0.006	0.695 ± 0.001	0.695 ± 0.001	1.900 ± 0.994	507.878 ± 72.652
B050-E50M50I00	0.521 ± 0.038	0.691 ± 0.012	0.691 ± 0.012	2.000 ± 1.414	517.756 ± 104.309
CLT-E	0.580 ± 0.075	0.663 ± 0.037	0.663 ± 0.037	2.600 ± 1.955	537.484 ± 77.764
CLT-E-Tight	0.546 ± 0.061	0.680 ± 0.025	0.680 ± 0.025	2.400 ± 2.757	508.804 ± 94.483

Summary of average core performance metrics across attention mechanism variants on the SST-2 dataset. Each method reports the mean and standard deviation of test accuracy, test loss, best validation loss (checkpoint criterion), the epoch index at which the minimum validation loss occurs (Epochs Run), and total training time across 10 independent trials. The baseline transformer is used as the reference condition. Bold values indicate the best-performing method for each metric, while italic values indicate the second-best performance.

Table 7. Attention Allocation SST-2.

Method	Entropy	Budget	Budget Max	Budget Min	SL-Correlation
Baseline	5.545 ± 0.000	0.300 ± 0.000	1.000 ± 0.000	1.000 ± 0.000	0.000 ± 0.000
B030-E40M40I20	2.285 ± 0.299	0.344 ± 0.042	0.004 ± 0.000	0.808 ± 0.295	0.990 ± 0.030
B030-E50M50I00	2.635 ± 0.877	0.556 ± 0.189	0.461 ± 0.369	0.455 ± 0.383	0.397 ± 0.513
B050-E40M40I20	3.230 ± 0.072	0.521 ± 0.011	0.004 ± 0.000	0.840 ± 0.280	1.000 ± 0.001
B050-E50M50I00	3.700 ± 0.661	0.714 ± 0.140	0.412 ± 0.424	0.331 ± 0.401	0.498 ± 0.525
CLT-E	3.697 ± 1.326	0.718 ± 0.186	0.403 ± 0.374	0.240 ± 0.304	0.672 ± 0.466
CLT-E-Tight	3.803 ± 0.602	0.713 ± 0.079	0.276 ± 0.325	0.293 ± 0.340	0.594 ± 0.512

Summary of attention allocation and cognitive-load alignment metrics across SST-2 attention variants. Reported diagnostics include mean attention entropy, mean budget utilization, budget saturation indicators (percentage of time spent at maximum and minimum budget levels), and a Spearman-like correlation between cognitive-load estimates and allocated attention mass, aggregated across 10 trials. These diagnostics characterize how each CLT variant modulates attention relative to the baseline.

Table 8. Core Metrics for DBpedia.

Method	Accuracy	Test Loss	Best Val Loss	Epochs Run	Training Time
Baseline	0.131 ± 0.012	2.460 ± 0.067	2.460 ± 0.067	2.900 ± 2.183	3095.749 ± 613.678
B030-E40M40I20	0.266 ± 0.069	1.928 ± 0.097	1.929 ± 0.097	7.500 ± 3.894	5202.697 ± 953.352
B030-E50M50I00	0.261 ± 0.040	1.882 ± 0.073	1.883 ± 0.072	7.200 ± 3.521	5268.715 ± 797.974
B050-E40M40I20	0.252 ± 0.068	1.990 ± 0.203	1.991 ± 0.203	5.200 ± 4.185	4693.188 ± 967.765
B050-E50M50I00	0.241 ± 0.069	1.945 ± 0.178	1.945 ± 0.178	8.000 ± 3.496	5319.781 ± 853.481
CLT-E	0.246 ± 0.084	1.943 ± 0.225	1.943 ± 0.225	7.700 ± 3.773	5261.385 ± 844.108
CLT-E-Tight	0.216 ± 0.060	1.928 ± 0.088	1.928 ± 0.088	9.100 ± 1.449	5721.016 ± 6.764

Summary of average core performance metrics across attention mechanism variants on the DBpedia dataset. Each method reports the mean and standard deviation of test accuracy, test loss, best validation loss, the selected checkpoint’s epoch index (until early stopping), and total training time across 10 trials. The comparison includes the baseline transformer and multiple CLT-based attention configurations (budget–encoding–memory–inhibition weight settings, plus CLT-E variants). The baseline model is used as the reference condition. Bold values indicate the best-performing method for each metric, while italic values indicate the second-best performance.

Table 9. Attention Allocation for DBpedia.

Method	Entropy	Budget	Budget Max	Budget Min	SL-Correlation
Baseline	5.545 ± 0.000	0.300 ± 0.000	1.000 ± 0.000	1.000 ± 0.000	0.000 ± 0.000
B030-E40M40I20	3.531 ± 0.203	0.532 ± 0.038	0.005 ± 0.000	0.134 ± 0.119	0.924 ± 0.019
B030-E50M50I00	3.912 ± 0.203	0.675 ± 0.040	0.052 ± 0.032	0.032 ± 0.021	0.999 ± 0.000
B050-E40M40I20	4.068 ± 0.313	0.656 ± 0.053	0.006 ± 0.002	0.191 ± 0.247	0.939 ± 0.026
B050-E50M50I00	4.243 ± 0.410	0.732 ± 0.085	0.157 ± 0.300	0.157 ± 0.298	0.899 ± 0.316
CLT-E	3.673 ± 0.583	0.668 ± 0.018	0.093 ± 0.171	0.084 ± 0.131	0.899 ± 0.316
CLT-E-Tight	4.348 ± 0.103	0.753 ± 0.021	0.070 ± 0.044	0.109 ± 0.104	0.997 ± 0.005

Summary of attention allocation diagnostics computed during evaluation on the DBpedia dataset. Reported values are the mean and standard deviation across 10 trials for: mean attention entropy, mean budget utilization, budget saturation indicators (percentage of time at maximum and minimum budget), and a Spearman-like correlation between cognitive-load estimates and allocated attention mass. These diagnostics characterize how each CLT variant redistributes attention under budget constraints relative to the baseline model.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Graham, J.; Sheng, V.S. A Cognitive Load Theory-Informed Attention Mechanism for Transformer-Based Text Classification. Mathematics 2026, 14, 1133. https://doi.org/10.3390/math14071133

AMA Style

Graham J, Sheng VS. A Cognitive Load Theory-Informed Attention Mechanism for Transformer-Based Text Classification. Mathematics. 2026; 14(7):1133. https://doi.org/10.3390/math14071133

Chicago/Turabian Style

Graham, Jarrod, and Victor S. Sheng. 2026. "A Cognitive Load Theory-Informed Attention Mechanism for Transformer-Based Text Classification" Mathematics 14, no. 7: 1133. https://doi.org/10.3390/math14071133

APA Style

Graham, J., & Sheng, V. S. (2026). A Cognitive Load Theory-Informed Attention Mechanism for Transformer-Based Text Classification. Mathematics, 14(7), 1133. https://doi.org/10.3390/math14071133

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Cognitive Load Theory-Informed Attention Mechanism for Transformer-Based Text Classification

Abstract

1. Introduction

1.1. Background

1.2. Related Work

1.2.1. Efficient and Long-Context Attention

1.2.2. Sparse and Adaptive Attention Mechanisms

1.2.3. Cognitive-Load-Aware Neural Modeling

1.2.4. Positioning of Our Contribution

2. Materials and Methods

2.1. Theory and Explanation

2.1.1. Transformer Self-Attention

2.1.2. Cognitive Load as a Token-Level Computational Signal

2.1.3. Components of Cognitive Load

2.1.4. Attention Budgets and CLT-Constrained Attention

2.1.5. Training Integration of CLT-Constrained Attention

2.2. Methodology

2.2.1. Workflow Overview

2.2.2. Data Preprocessing

2.2.3. Model Architecture

2.2.4. Attention Mechanism Variants

2.2.5. Experimental Setup

2.2.6. Computational Complexity

3. Results

3.1. IMDB

3.1.1. Core Performance Results

3.1.2. Key Performance Observations

3.1.3. Attention Allocation Analysis

3.2. AG News

3.2.1. Core Performance Results

3.2.2. Key Performance Observations:

3.2.3. Attention Allocation Analysis

3.3. SST-2

3.3.1. Core Performance Results

3.3.2. Key Performance Observations

3.3.3. Attention Allocation Analysis

3.4. DBpedia

3.4.1. Core Performance Results

3.4.2. Key Performance Observations

3.4.3. Attention Allocation Analysis

4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI