1. Introduction
The field of machine learning has been fundamentally reshaped by the emergence of Large Language Models (LLMs) [
1], which have demonstrated unprecedented capabilities across a vast spectrum of tasks from fluent text generation and translation to complex question answering and software development [
2,
3,
4]. The architectural linchpin of these models is the transformer [
5,
6], specifically its self-attention mechanism. For autoregressive generation, which forms the basis of models like GPT-4 and LLaMA, this mechanism is adapted into what is known as
causal attention. This is achieved by applying a static triangular mask to the attention matrix, thereby enforcing a strict and inviolable form of informational asymmetry: at any given step, a token can only attend to itself and to all preceding tokens in the sequence.
This principle of static unidirectional causality has been astonishingly effective, serving as the foundational design choice that enables models to learn sequential dependencies and generate coherent text one token at a time. The unidirectional causality it imposes—information flows only from the past to the future—is the very engine of autoregressive modeling. This creates a fundamental hard-coded asymmetry in information flow. However, while simple and computationally efficient, the rigidity of this approach represents a significant and often-overlooked conceptual limitation. Importantly, it implicitly assumes that all valid reasoning and dependency structures are strictly temporal and linear. This flawed assumption is not inherent to the model’s parameters but is hard-coded into the architecture by the causal mask itself. The mask allows a token to attend only to a contiguous unbroken sequence of its predecessors, treating the context as a simple timeline. This approach falters when confronted with the intricate nonlinear nature of human language and logic, in which dependencies are based on semantics, syntax, and context rather than just proximity. Consider the following examples where this linear/temporal assumption breaks down:
Logical Reasoning: In an argument such as “Premise 1: All humans are mortal. Premise 2: Socrates is a human”, the conclusion depends directly and strongly on Premise 1 and Premise 2, regardless of how many sentences might separate them. The dependency graph is semantic, not sequential.
Anaphoric Reference in Discourse: In the text “The main server, which handles all primary requests, began to overheat. After several restarts,” the pronoun “it” has a powerful non-local dependency on its antecedent “The main server”, skipping over the entire intermediate clause.
Code Generation: When completing a function, a line such as return MAX_RETRIES * backoff_factor; may be defined using constants or global variables that occur many lines or even many files away.
In all of these cases, the standard attention mechanism is forced to learn these complex nonlinear dependency arcs through the indirect and inefficient means of its value and output projections. It lacks a direct mechanism to express that “token A is more important than token B, even though B is closer.” This highlights the need for a more flexible mechanism that can learn and apply a content-aware, dynamic, and nonlinear information flow policy. Our work introduces such a mechanism. Recent work has highlighted that LLMs do not even leverage their full allowed context uniformly, often struggling with information located in the middle of long sequences [
7]. This suggests that the simple “all-past” attention policy is suboptimal. These challenges become particularly acute in tasks requiring multi-step reasoning, where the model must synthesize and dynamically prioritize information from a complex context in order to build a coherent line of thought [
8,
9].
In this paper, we posit that the future of capable and transparent LLMs lies in moving beyond this static conception of causality. We introduce Dynamic Asymmetric Attention (DAA), a novel mechanism designed to augment the standard attention block. Rather than replacing the causal mask entirely, which remains essential for autoregressive stability, DAA works in concert with it to introduce a learnable context-sensitive guidance system. Specifically, DAA employs a lightweight and parameter-efficient neural network—typically a small Multi-Layer Perceptron (MLP)—that takes the query vector () and a key vector () as input, then outputs a single scalar value. This scalar is added as a bias to the pre-softmax attention score for that specific query–key pair. This mechanism facilitates a form of content-aware attention modulation; a positive bias dynamically strengthens the informational pathway from a specific past token, while a negative bias weakens it. In essence, the model learns a sophisticated task-specific policy for asymmetric information flow, deciding not just if it can see the past, but how intensely it should focus on each part of it. Our empirical results robustly validate this approach. By integrating DAA into LLMs, we not only reduce perplexity on standard language modeling benchmarks but also achieve significant accuracy gains of +3.1% on the HumanEval code generation task and +4.5% on the GSM8k mathematical reasoning benchmark, surpassing the baseline model with a negligible increase in parameters. Crucially, this learned asymmetry provides an unprecedented window into the model’s inner workings. Visualizing the DAA biases elucidates the emergent directed acyclic graphs of the model’s reasoning, revealing precisely how it prioritizes premises, follows logical steps, and resolves dependencies on-the-fly. Therefore, this transition from a static hard-coded rule to a learned dynamic policy is able to unlock both superior performance and a profound new level of model interpretability.
Our work makes the following key contributions:
We propose Dynamic Asymmetric Attention (DAA), a new attention mechanism that learns a soft and context-dependent asymmetric bias, offering a more flexible and powerful alternative to the static causal mask.
We demonstrate empirically that integrating DAA into LLMs leads to significant performance improvements on both foundational language modeling benchmarks and complex reasoning-intensive tasks such as code generation (HumanEval) and mathematical problem-solving (GSM8k).
We introduce a novel approach to model interpretability by visualizing the learned DAA biases. These visualizations reveal the implicit “information flow graphs” that the model constructs during inference, providing unprecedented insight into internal reasoning processes such as how the model follows a chain of thought or resolves dependencies.
The rest of this paper is structured as follows:
Section 2 discusses related work;
Section 3 presents the design and implementation of the DAA mechanism;
Section 4 describes the experimental setup;
Section 5 presents our experiments and subsequent analysis;
Section 6 introduces the impact of DAA on fine-tuning;
Section 7 extends the applicability of DAA to bidirectional model registration; finally,
Section 8 concludes the paper.
2. Related Work
Our research on DAA intersects with three primary areas of active investigation in large language models: the architectural evolution of attention mechanisms, techniques for improving complex reasoning, and methods for model interpretability. We review each in turn.
2.1. Evolution of Attention Mechanisms
The self-attention mechanism [
5] remains the cornerstone of modern LLMs. Its core formulation, however, presents a quadratic complexity in sequence length (
), making it computationally prohibitive for very long contexts. A significant body of research has consequently focused on improving its efficiency. Methods such as Sparse Transformers [
10], Longformer [
11], and Big Bird [
12] have introduced sparse attention patterns (e.g., combining local, global, and random attention) to achieve near-linear complexity. While these methods successfully extend the context length, their primary focus is on computational efficiency rather than enhancing the expressive power of the attention mechanism itself. Their imposed patterns are largely static and data-agnostic.
Another line of work modifies the core attention calculation to better incorporate positional information or other inductive biases. For example, Rotary Positional Embedding (RoPE) [
13] cleverly rotates query and key vectors based on their absolute position, injecting relative positional information directly into the attention scores. A particularly relevant precursor is Attention with Linear Biases (ALiBi) [
14], which adds a static position-dependent bias to attention scores, eliminating the need for explicit positional embeddings. The bias in ALiBi is a simple linear penalty based on the distance between the query and key [
15]. This establishes the principle that biasing attention can be effective. However, DAA fundamentally differs from ALiBi; while ALiBi’s bias is
static and purely a function of token distance, DAA’s bias is
dynamic and
content-aware, learned as a function of the query and key vectors themselves. This allows DAA to model dependencies that are semantic and logical, not merely positional [
16,
17].
2.2. Enhancing Reasoning in Large Language Models
A key frontier for LLMs is improving their capacity for complex multi-step reasoning. The dominant paradigm for this has been advanced prompting techniques [
18]. The seminal work on Chain-of-Thought (CoT) prompting [
8,
19] showed that instructing a model to “think step-by-step” dramatically improves performance on arithmetic, commonsense, and symbolic reasoning tasks. This has been extended by methods such as Self-Consistency [
20,
21,
22], which samples multiple reasoning paths and selects the most consistent answer, and Tree of Thoughts (ToT) [
9], which allows models to deliberately explore and self-evaluate different branches in a thought process.
While these methods are powerful, they operate at the prompting or inference level, treating the underlying model architecture as a fixed black box [
23]. They guide the model’s “generation” process without altering its internal “information processing” mechanism. Our work proposes a complementary and architectural approach. DAA endows the model with an intrinsic ability to learn directed information flow, which we hypothesize creates a stronger inductive bias for logical and causal reasoning. An architecture that can internally learn to prioritize premises and connect logical steps should be more sample-efficient and robust, potentially amplifying the benefits of advanced prompting strategies such as CoT.
2.3. Interpretability of Attention and Model Decisions
As LLMs become more powerful, understanding their decision-making process becomes more critical [
24]. The role of attention in interpretability has been the subject of extensive debate. While initially appealing, the notion that “attention is explanation” has been challenged by studies showing that attention weights can be uncorrelated with feature importance measures [
25]. Subsequent work has provided a more nuanced view, suggesting that while not a direct explanation, attention patterns can still be a useful signal when analyzed carefully [
26,
27]. To move beyond raw attention, more sophisticated interpretability methods have been developed [
28]. Probing techniques analyze a model’s internal representations to determine whether they encode specific linguistic or factual knowledge [
29]. More recently, causal mediation analysis techniques such as locating and editing factual associations [
30,
31] have offered powerful ways to trace a specific model behavior (e.g., recalling a fact) back to the specific model components responsible for it. These methods are highly insightful but are often post hoc, computationally intensive, and focused on explaining localized and specific model outputs.
DAA offers a new built-in modality for interpretability. The learned attention bias is not a noisy byproduct of the attention calculation but a direct and trained signal representing the model’s learned policy for information flow. By design, it captures the directional importance that the model assigns between tokens. Visualizing these DAA biases provides a more direct and global view of the model’s internal “reasoning graph” than raw attention weights, offering a scalable and intuitive way to observe how the model constructs its arguments and dependencies without requiring complex post hoc analysis.
3. Methodology
This section provides a formal definition of DAA, details its architectural rationale, and analyzes its properties and integration within a standard transformer block.
3.1. From Static to Dynamic Asymmetry: Core Concept
The foundation of autoregressive LLMs is the causal self-attention mechanism, where the attention weights
are computed as
where
are the query and key matrices and
is a binary mask that assigns
to all elements
with
and assigns 0 otherwise. This enforces a hard uniform constraint on information flow.
We propose to augment this formulation with a learned dynamic bias. The core idea is to introduce a bias matrix
that is a function of the queries and keys themselves. The new attention score matrix
is computed as
The key innovation lies in the computation of
, where each element
is generated by a dedicated network,
:
This formulation transforms the attention mechanism. The model is no longer limited to asking ”Can I attend to this token?” but can now learn ”Given my current query and this past token’s key, how much *should* I amplify or suppress the information from it?”
3.2. The Guidance Network: Architecture and Rationale
The Guidance Network, denoted , is the engine of DAA. Its design is guided by two principles: expressive power and computational efficiency.
3.2.1. Input Representation and Network Architecture
To model the directional relationship from a key at position j to a query at position i, the Guidance Network must process both vectors. We first concatenate them to form the input , where is the key/query dimension. Concatenation is a general-purpose choice that allows a subsequent neural network to learn arbitrary interactions between the two vectors. We implement the Guidance Network as a shallow but effective Multi-Layer Perceptron (MLP) with one hidden layer. Its design is intended to be computationally efficient while also being sufficiently expressive to learn complex nonlinear relationships. The forward pass of the network for a single query–key pair is defined by the following sequence of operations:
Input Layer: Takes the concatenated vector as input.
Hidden Layer: Consists of three sequential operations:
- (a)
A Linear projection from to the hidden dimension .
- (b)
Layer Normalization applied to the hidden representation for training stability.
- (c)
A GELU (Gaussian Error Linear Unit) nonlinear activation function.
Output Layer: A final Linear layer that projects the activated hidden representation from down to a single scalar value.
This sequence can be expressed formally as
The key architectural parameters for our implementation are summarized in
Table 1.
The choice of a single hidden layer with a small dimension () ensures that the network remains parameter-efficient and computationally lightweight, which is a core design goal of the DAA mechanism.
3.2.2. Rationale for the MLP Architecture and Alternatives Considered
The choice of a shallow Multi-Layer Perceptron (MLP) for the Guidance Network is a deliberate decision aimed at balancing expressivity with efficiency. While an MLP is a strong general-purpose choice, it is instructive to consider alternative architectures in order to understand the design tradeoffs.
Bilinear Forms: A potential alternative is a bilinear interaction, such as , where is a learnable matrix. This is more expressive than the standard dot product, as can learn weighted interactions between different dimensions of the query and key. However, this form is fundamentally limited to learning second-order multiplicative interactions. It lacks the capacity to model more complex, conditional, or additive logic that can be captured by the nonlinear activation function in an MLP. Furthermore, its parameter cost of can be substantial.
Attention-Based Gating: One could envision a more complex “meta-attention” mechanism in which, for instance, the query vector “attends” to the key vector to produce a scalar gate. While theoretically powerful, this approach would introduce a prohibitive level of computational complexity. It would involve executing an attention-like calculation for each of the query–key pairs in the main attention matrix, defeating the purpose of designing a lightweight architectural enhancement.
Simpler Interactions: The simplest interaction is the standard dot product, . Using this to model the bias would be redundant, as it merely replicates the core computation already present in the standard attention mechanism. It provides no capacity to learn a new task-specific biasing policy that is decoupled from simple semantic alignment.
MLP as a Balanced Choice:
Given these alternatives, a shallow MLP emerges as the most practical and effective solution. As a universal function approximator, an MLP with even a single hidden layer and a nonlinear activation function (such as GELU) can, in principle, learn any continuous function mapping the query–key pair to a bias score. This provides the flexibility to learn the complex nonlinear dependencies required for sophisticated reasoning without the prohibitive computational cost of a nested attention mechanism or the representational limitations of a purely bilinear form. Thus, it represents a “sweet spot” in the tradeoff between expressive power and computational/parametric cost.
3.2.3. Integration Within the Transformer Block
DAA is designed as a drop-in enhancement to the Multi-Head Self-Attention (MHSA) block.
Synergy with Causal Mask: Crucially, DAA augments rather than replaces the causal mask. The causal mask provides the non-negotiable temporal boundary essential for stable autoregressive training. DAA operates within this allowed space, providing fine-grained and content-aware modulation.
Multi-Head Strategy: To balance expressive power and parameter efficiency, we employ a parameter sharing strategy. A single Guidance Network is instantiated per DAA-enabled layer, and its parameters are shared across all attention heads in that layer. The same bias matrix is added to the pre-softmax scores of each head. This encourages the layer to learn a more general and transferable policy for information routing while drastically reducing the parameter footprint compared to instantiating a separate network per head.
3.2.4. A Shared Multi-Head Strategy: Rationale, Expressivity, and Inductive Bias
A critical design decision in integrating DAA with the Multi-Head Self-Attention (MHSA) block is whether the Guidance Network () should have its parameters shared across all attention heads within a layer or if each head h should learn its own specific network, i.e., . We opted for a shared strategy, a choice that carefully trades raw expressive power for a more beneficial inductive bias, improved parameter efficiency, and enhanced generalization.
Head-Specific vs. Shared Networks: A Tradeoff
A head-specific architecture in which each attention head has its own dedicated Guidance Network would theoretically offer maximum expressivity. Each head could learn a completely unique and specialized policy for information routing. For instance, one head might specialize in tracking syntactic dependencies, while another could learn to identify semantic relationships between entities.
However, this approach comes with significant drawbacks:
Parameter Inefficiency: Instantiating a separate MLP for each head in every DAA-enabled layer would lead to a substantial increase in model parameters, negating the “lightweight” advantage of our approach.
Risk of Overfitting: The high number of parameters and immense flexibility could encourage individual heads to learn spurious context-specific patterns from the training data rather than generalizable reasoning policies. Recent research has already shown that many heads in a standard transformer can be redundant.
Knowledge Fragmentation: Learning dozens of independent policies could lead to a fragmented and potentially contradictory understanding of information flow within the model, making the overall reasoning process less coherent and robust.
The Inductive Bias of a Shared Guidance Network
Our choice of a shared Guidance Network deliberately introduces a strong, and as we argue, beneficial inductive bias. An inductive bias is the set of assumptions a model uses to generalize from training data to unseen inputs. The core hypothesis behind our choice is that within a given layer of a transformer, the fundamental principles of logical and semantic information flow are largely universal and transferable across the different representational subspaces captured by the attention heads.
By forcing all heads in a layer to use the same function to calculate the attention bias, we compel the model to discover a single general policy for information routing. This policy must be effective across all heads, meaning that it cannot rely on the idiosyncratic properties of a single head’s query/key space. Instead, it must learn more fundamental patterns. For example, a learned rule such as “amplify the informational pathway from numerical tokens when a mathematical operator is part of the query” is a general principle of arithmetic reasoning that should be applied universally rather than as a specialized skill of a single head.
This inductive bias provides several key advantages:
Promotes Generalization: By learning a more abstract and transferable policy, the model is less likely to overfit and more likely to generalize to unseen problems and reasoning chains. The strong zero-shot performance of our DAA-augmented models on benchmarks such as GSM8k and HumanEval provides empirical support for this claim. The model learns a robust method for reasoning, not just memorized patterns.
Improves Parameter and Data Efficiency: Sharing parameters is highly efficient, keeping the DAA module’s overhead minimal. As shown in our ablation studies, DAA already achieves significant performance gains with a modest parameter budget (8.4 M for LLaMA-2-7B). This efficiency also implies better data efficiency, as a single shared network has a more concentrated learning signal from all heads, rather than dividing the learning signal among many separate networks.
Enhances Coherence and Interpretability: A single unified policy per layer offers a clearer window into the model’s reasoning strategy. Instead of trying to decipher dozens of potentially conflicting head-specific graphs, we can visualize one coherent “information flow graph”, making the model’s decision-making process more transparent and easier to analyze.
In summary, the decision to share the Guidance Network is a deliberate architectural choice that trades the brute-force expressivity of head-specific networks for a powerful inductive bias. This bias guides the model to learn information routing policies that are more general, robust, and efficient, which we believe is a key driver of the enhanced reasoning capabilities we demonstrate.
3.2.5. The Nature of the Causal Constraint: Hard and Uniform
The statement that the binary causal mask enforces a “hard and uniform constraint” warrants a more detailed explanation, as these two properties are the primary limitations that DAA aims to address.
The term ’hard’ refers to the absolute and non-negotiable nature of the rule imposed by the mask. For any given query token, a connection to a future key token is not merely discouraged or penalized, it is programmatically forbidden.
Binary Logic: The information flow is governed by a strict binary switch. A token is either 100% visible (if it is in the past) or 0% visible (if it is in the future). There is no mechanism for partial or attenuated visibility.
Mechanism: This is achieved by setting the attention scores of masked tokens to . After the softmax operation, the attention weight assigned to these tokens becomes exactly zero.
This can be contrasted with a “soft” constraint, which might apply a learnable penalty to certain connections but would not necessarily sever them completely. The causal mask’s hard constraint removes all architectural flexibility for information to flow from the future.
The term “uniform” refers to the fact that the mask grants the same permission to attend to all allowed past tokens regardless of their content or relative importance.
Undifferentiated Access: For a query predicting the next token in the sentence "The cat sat on the mat", the mask allows the model to look back at all preceding words. However, the permission to attend to the crucial noun "cat" is identical to the permission to attend to the less informative determiner "The".
Content-Agnosticism: The mask itself does not provide any guidance about which past tokens are more semantically or logically relevant. The burden of differentiating between important premises and irrelevant filler words falls entirely on the query–key similarity computation ().
This uniform access policy is suboptimal; in reality, dependencies are non-uniform, with certain tokens being vastly more important than others. The inability of the mask to provide any content-aware guidance is a core inefficiency that DAA directly addresses by learning a non-uniform dynamic bias.
3.3. End-to-End Learning
The parameters of the Guidance Network are not trained with a separate objective; they are learned end-to-end jointly with all other model parameters (e.g., , and FFN weights) by minimizing the standard cross-entropy loss for next-token prediction. The gradients flow from the loss function through the softmax and the DAA bias term directly back to .
3.4. Initialization for Stable Training
To ensure training stability, particularly at the outset, we employ a specific and principled initialization strategy for . While the weights and biases of the hidden layer () are initialized using a standard method such as Kaiming initialization, the parameters of the final linear output layer () are initialized to zero.
This is a critical design choice, not a minor detail. Its purpose is to prevent the destabilization of the large pretrained base model at the beginning of the continued training phase. A standard random initialization would cause the DAA module to output a random unstructured bias matrix, injecting significant noise into the attention mechanism from the very first forward pass. This would lead to a large initial loss spike and potentially corrupt the valuable pre-existing weights of the base LLM.
By zero-initializing the output layer, we ensure that at step 0, the bias matrix is a zero matrix. Consequently, the DAA-augmented model is functionally identical to the original model, preserving its initial performance and stability. This creates a smooth learning trajectory in which the DAA module only begins to influence the attention scores as the backpropagated gradients start to indicate that doing so is beneficial for minimizing the loss. This strategy allows the model to gracefully “fade in” the new architectural component, which is essential for successfully augmenting pretrained models. While we did not include a formal ablation on this point, our preliminary experiments confirmed that failing to use this zero-initialization resulted in unstable training and worse final performance. The overall DAA forward pass is summarized in Algorithm 1.
The derivation of the time and space complexity for the DAA-augmented MHSA layer, as detailed in Algorithm 1, is provided below.
Algorithm 1 Detailed Forward Pass of a DAA-Augmented MHSA Layer |
- 1:
Input: Input embeddings , layer parameters , shared Guidance Network . - 2:
Output: Layer output . - 3:
- 4:
// Compute shared Dynamic Asymmetric Bias - 5:
Project to get a reference for the bias calculation. - 6:
Initialize . - 7:
for to n do - 8:
for to i do - 9:
▹ Content-aware bias - 10:
end for - 11:
end for - 12:
- 13:
// Multi-Head Attention Calculation - 14:
Initialize . - 15:
for head to H do - 16:
; ; - 17:
- 18:
▹ Add shared bias - 19:
- 20:
- 21:
Append to . - 22:
end for - 23:
- 24:
// Concatenate and project - 25:
- 26:
- 27:
return
|
Time Complexity Derivation: The algorithm’s time complexity is primarily determined by two key computational steps. First, the calculation of the DAA bias matrix () involves nested loops with a complexity of , where n is the sequence length, is the key/query dimension, and is the hidden dimension of the guidance network. Second, the core calculation within the Multi-Head Self-Attention (MHSA), i.e., the matrix multiplication, has a complexity of . As , the total complexity for the MHSA is . Because is designed to be a small constant, the introduction of DAA does not change the asymptotic time complexity of the attention layer, which remains .
Space Complexity Derivation: The space complexity is dominated by the largest intermediate matrices that must be stored during the forward pass. The DAA bias matrix (), causal mask (), and attention score matrices () all have dimensions of n×n. Therefore, the space complexity is dominated by these matrices, resulting in an order of .
3.5. Synergy with the Causal Mask and Limits of Augmentation
A critical aspect of DAA’s design is its interaction with the standard causal mask, which provides the “non-negotiable temporal boundary” essential for stable autoregressive training. While the DAA mechanism is designed to be a powerful augmentation, its power is explicitly limited to operate within the confines set by the causal mask. It refines the model’s focus on the past, but cannot reveal the future.
This safeguard is guaranteed by the order of operations in the final attention score calculation. The computation of the final pre-softmax attention scores can be seen as a three-step process:
The standard attention scores are computed: .
The causal mask is applied; this is a matrix in which elements corresponding to future tokens are and all others are 0.
The DAA bias matrix , containing learned finite-valued biases, is added.
Therefore, the final score matrix is
The crucial insight lies in the interaction between
and
. For any token pair
where
, i.e., a future token, the corresponding element in
is
. When we add the corresponding DAA bias
, which is a finite scalar, the result remains negative infinity:
This ensures that DAA can
never ‘rescue’ or grant access to a token that the causal mask has already forbidden. After the softmax operation, the attention weight for any such token will be exactly zero.
Therefore, the limit of the augmentation is clear: DAA provides a sophisticated, learned, and non-uniform policy for how the model should prioritize information from the allowed causal past. By design, it cannot alter the fundamental boundary between past and future that is essential for autoregressive modeling.
4. Experimental Setup
To rigorously evaluate the performance and utility of DAA, we designed a comprehensive suite of experiments. We detail our datasets, models, baselines, and implementation specifics below.
4.1. Datasets and Tasks
4.1.1. Pretraining Corpus
To effectively train the parameters of the Guidance Network (
), following standard practice for architectural experiments, we perform continued pretraining rather than training from scratch, which is computationally prohibitive. We use a high-quality 10-billion-token subset of the
RedPajama-1T dataset [
32]. This corpus is a diverse mixture of web text, books, code, and academic papers, providing a rich signal for learning general-purpose information flow policies.
4.1.2. Downstream Evaluation Tasks
We evaluate our models on a carefully selected set of four downstream tasks, covering language modeling, code generation, mathematical reasoning, and long-context understanding.
Language Modeling: We measure perplexity on the validation set of the
C4 dataset (Cleaned Common Crawl) [
33]. Lower perplexity indicates a better fundamental grasp of language structure and predictability.
Code Generation: We use
HumanEval [
34], a standard benchmark consisting of 164 handwritten programming problems. The metric is
pass@1, which measures the percentage of problems for which at least one correct solution is generated in a single attempt.
Mathematical Reasoning: We use
GSM8k [
35], a dataset of 8.5k grade-school math word problems. This task requires multi-step logical reasoning to arrive at a final numerical answer. We report the final accuracy.
Long-Context Question Answering: We use the
QuALITY dataset [
36], which features long narrative articles and requires models to answer multiple-choice questions to which the answers are often not explicitly stated and require integrating information from across the document. We report accuracy on this challenging QA task.
4.1.3. Data Preprocessing and Prompting
To ensure reproducibility, we detail our data preprocessing pipeline and the specific prompt structures used for downstream evaluation below.
All text data for both pretraining and downstream tasks were processed using the official LLaMA-2 tokenizer. This is a SentencePiece-based tokenizer that utilizes a Byte-Pair Encoding (BPE) algorithm with a vocabulary of 32,000 tokens. Using a consistent tokenizer across all stages ensures that the model’s learned representations are stable and meaningful.
The sequence length was managed as follows:
Continued Pretraining: For continued pretraining on the RedPajama subset, we used a maximum sequence length of 2048 tokens. Documents longer than this were truncated by removing tokens from the end. Shorter documents were processed as-is without padding.
Downstream Evaluation: For most downstream tasks, the inputs fit comfortably within this window. For specific long-context tasks such as QuALITY and our “lost in the middle” experiment, we utilized a sequence length of 4096 tokens to evaluate the model’s long-range capabilities.
Our DAA-enhanced models were evaluated in a zero-shot setting, meaning that no examples were provided in the prompt. The prompts were structured to directly query the model for the answer.
GSM8k (Mathematical Reasoning): The prompt consisted of the question prefixed by Q: and followed by a newline and A:, which serves as the trigger for the model to generate the step-by-step solution and final answer.
Structure:
Q: [Question Text]
A:
HumanEval (Code Generation): The prompt was the unaltered content of the benchmark’s prompt field, which includes the function signature, docstring, and sometimes preceding helper functions or imports. The model’s task is to generate the Python code (Version 3.9.0) that completes the function body.
Structure:
[Function Signature]
"""
[Docstring containing function description and examples]
"""
QuALITY (Long-Context QA): The prompt was constructed by concatenating the full article text, followed by the question and the multiple-choice options, and finally a trigger for the answer.
Structure:
[Article Text]
Question: [Question Text]
Options:
(A) [Option A Text]
(B) [Option B Text]
...
Answer:
4.2. Baselines
The comparison is performed against the following six baselines:
- (1)
LLaMA-2 7B (Base): The original unmodified LLaMA-2 7B model. This serves as our primary point of comparison to measure the direct impact of DAA.
- (2)
Mistral-7B (Base): The original unmodified Mistral-7B model.
- (3)
LLaMA-2 7B + ALiBi: A strong architectural baseline. We replace the RoPE positional embeddings of LLaMA-2 with ALiBi [
14], which also uses an attention bias but one that is static and content-agnostic. This directly contrasts our dynamic approach with a static one.
- (4)
Few-Shot Prompting: We evaluate the base models using standard few-shot prompting (5 shots) on reasoning tasks to establish a strong non-architectural baseline performance.
- (5)
Few-Shot Chain-of-Thought (CoT): A powerful inference-time technique. We provide the base models with 5-shot examples that include step-by-step reasoning, following the method of [
8].
- (6)
Self-Consistency with CoT: A state-of-the-art inference baseline. We sample multiple reasoning paths (
) using CoT and select the most frequent answer, as described in [
20]. This represents the upper bound of what can be achieved with advanced prompting on the base models.
4.3. Implementation and Training Details
4.3.1. Model Configuration and DAA Integration
To demonstrate the general applicability of our method, we integrate DAA into two different state-of-the-art open-source LLMs: LLaMA-2-7B [
37] and Mistral-7B [
38]. The decision to integrate DAA blocks into every other transformer layer, starting from the first, was guided by both theoretical considerations and direct empirical optimization. Theoretically, reasoning in LLMs is a hierarchical process, and we hypothesized that the ability to dynamically route information would be beneficial at multiple stages of representation—from lower-level semantic grouping to higher-level abstract reasoning.
However, applying DAA to every single layer could introduce significant computational overhead and parameter redundancy for potentially diminishing returns. To determine the most effective and efficient integration strategy, we performed a rigorous ablation study on the density of DAA layer placement. As we detail in our analysis, the results demonstrated that the “every other layer” approach provides a substantial performance improvement over a sparser application (i.e., every fourth layer), achieving nearly the same performance as applying DAA to every layer but with half the parameter and computational cost. This empirical validation confirms that our chosen configuration represents the optimal tradeoff, endowing the model with sufficient dynamic routing capacity throughout its depth without unnecessary overhead. The DAA Guidance Network uses a hidden dimension of dh = 64, and its output layer is initialized to zero to ensure training stability.
4.3.2. Training Procedure
We take the official pretrained weights for LLaMA-2 7B and Mistral-7B as our starting point. We then perform continued pretraining on the 10B-token RedPajama subset for 50,000 steps with a global batch size of 1024 and a sequence length of 2048. We use the AdamW optimizer with a learning rate of , , and a cosine learning rate schedule with a 1000-step warmup. All experiments were conducted on 32 NVIDIA A100 (80GB) GPUs using PyTorch ( Version 2.3.0 ) and the Hugging Face transformer library.
4.3.3. Evaluation Framework and Metrics
For all downstream tasks, we employed a consistent and rigorous evaluation framework to ensure that comparisons between models are fair and reproducible. The framework consists of specific protocols for generation, scoring, and experimental controls.
Generative Reasoning Tasks (GSM8k & HumanEval): For tasks requiring free-form generation, we used a standardized decoding strategy across all models.
- -
Generation Parameters: We used nucleus sampling with a temperature of and a top-p value of . This low-temperature setting ensures that the model’s output is mostly deterministic and reflects its most confident reasoning path, while allowing for minor variations. A single output was generated for each problem (‘pass@1’ for HumanEval).
- -
Stopping Criteria: Generation was terminated upon producing a predefined stop token or a natural end-of-logic marker (e.g., a newline followed by “Q:” for GSM8k, or an end-of-function sequence for HumanEval).
- -
Scoring: For GSM8k, a parsing script was used to extract the final numerical answer from the generated chain-of-thought text before comparing it to the gold answer. For HumanEval, the generated Python code was executed against a standard suite of hidden unit tests to determine correctness.
Multiple-Choice Tasks (QuALITY): For multiple-choice question answering, we did not use free-form generation. Instead, we employed a likelihood-based evaluation method. For each question, we constructed a separate prompt for each possible choice by appending the choice text to the context and question. We then computed the perplexity of each complete prompt under the model and selected the choice corresponding to the prompt with the lowest perplexity (i.e., the highest likelihood).
Experimental Rigor: The exact same generation settings, prompting functions, and scoring scripts were used for all evaluated models (DAA-enhanced, baselines, and ablations) to ensure a fair and direct comparison. All reported metrics are the average of three independent runs using different random seeds for the model initialization and data shuffling where applicable.
This consistent framework ensures that any observed performance differences are attributable to the architectural modifications of the models themselves rather than to variations in the evaluation procedure.
5. Results and Analysis
We now present and analyze the empirical results of our experiments. Our analysis moves beyond surface-level metrics to dissect the mechanisms through which DAA improves model performance. We investigate its impact on downstream tasks, justify our design through ablations, offer a qualitative look into its emergent reasoning capabilities, and analyze its training dynamics.
5.1. Main Results
The primary evaluation results summarized in
Table 2 reveal a consistent and significant advantage for DAA-augmented models. Perplexity (PPL) is reported on C4 (lower is better). Accuracy (%) is reported for GSM8k and QuALITY. Pass@1 (%) is reported for HumanEval. Best performance is in
bold, second best is
underlined. Our DAA models are evaluated zero-shot.
The superior performance of DAA models, particularly in a zero-shot setting, strongly suggests that we have successfully embedded a crucial inductive bias for reasoning directly into the model’s architecture. While inference-time methods such as CoT and Self-Consistency are powerful, they are essentially “scaffolding” erected around a fixed architecture. They guide the model’s output generation, but do not alter its core information processing capabilities. In contrast, DAA modifies this core. The model does not just learn “what” to say, but fundamentally learns “how” to route and prioritize information to arrive at a conclusion.
The stark outperformance of DAA over ALiBi is particularly illuminating. ALiBi’s static distance-based penalty improves performance by providing a simple heuristic: “nearby is more important.” DAA’s performance demonstrates that this heuristic is insufficient. True reasoning requires a semantic and logical heuristic: “this specific premise is more important than that one, regardless of distance.” DAA learns this complex content-aware policy, explaining its significant lead on tasks such as GSM8k and HumanEval, where logical dependencies are non-local and semantic.
Furthermore, the fact that Mistral-7B-DAA surpasses even the sophisticated Self-Consistency baseline is remarkable. Self-Consistency relies on generating multiple reasoning paths and using a voting mechanism to filter out errors, a process that implicitly averages over the model’s internal inconsistencies. DAA appears to reduce this internal inconsistency from the outset by equipping the model with a more robust directed mechanism for following a single coherent line of thought. This suggests a potential synergistic relationship in which a DAA-enhanced model might provide even higher-quality candidate paths for Self-Consistency to work with.
5.2. Ablation Studies
The ablation studies, in
Table 3, provide critical insights into the nature of the learned information policies.
Guidance Network Capacity: The performance plateau observed as increases from 64 to 128 suggests that the complexity of the learned pairwise relationships is substantial but not infinite. A hidden dimension of 64 appears to be a “sweet spot”, providing enough capacity to model the crucial query–key interactions without overfitting or introducing excessive parameter overhead. This implies that the learned rules are generalizable patterns (e.g., “prioritize numbers when a mathematical operator is queried”) rather than memorized instance-specific connections.
Layer-wise Distribution of Reasoning: The ablation experiment on layer density reveals that the ability to dynamically route information is beneficial throughout the model’s representational hierarchy. Applying DAA only in early layers (e.g., every fourth layer) yields suboptimal results, suggesting that this mechanism is not merely for low-level feature grouping. The significant gain from applying it in every second layer indicates its utility in both intermediate semantic processing and higher-level abstract reasoning. The marginal gain from applying it to every layer suggests a degree of redundancy; a model with DAA in half of its layers can already propagate this refined information flow effectively. This points to an efficient implementation strategy for future work.
5.3. Emergence of a Dynamic Reasoning Graph
Figure 1 provides a qualitative glimpse into the internal mechanisms fostered by DAA, showing a visualization of the DAA bias matrix (
) for a specific attention head when generating the final answer “11”. The query token is “11”, and the key tokens are the preceding context. Blue cells indicate a strong positive bias (attraction), while red cells indicate a negative bias (repulsion). The DAA mechanism has learned to dynamically amplify the information flow from the crucial operands (“5” and “6”) while suppressing irrelevant tokens. The visualization is generated using the following procedure:
Input and Context: The model was fed the input prompt: "Q: John has 5 apples. He buys 6 more. A: 5 + 6 =". The model autoregressively generates tokens up to this point.
Capture at Generation Step: The visualization captures the model’s state at the precise moment it is generating the final token, "11". In the autoregressive framework, the query is derived from the last token in the sequence (here, "=") to predict the next token.
Extract Query and Key Vectors: From a specific layer (e.g., layer 15) and a specific attention head chosen for its clarity in illustrating the reasoning process, we extracted:
Compute DAA Biases: Each query–key pair was passed through the trained Guidance Network for that layer to compute the corresponding scalar bias value . This results in a vector of bias values, one for each token in the context relative to the query.
Render Visualization: This vector of scalar biases was then plotted. A diverging colormap (specifically, coolwarm in Matplotlib) was used to map the values to colors:
Strongly positive values are mapped to bright blue (attraction).
Strongly negative values are mapped to bright red (repulsion).
Values near zero are mapped to white.
This visualization offers profound insight into the core theme of this special issue, namely the interplay of symmetry and asymmetry. The standard attention mechanism is naively symmetric in its treatment of the past (within the causal mask, all positions are equally accessible). Our visualization shows DAA learning a highly specific task-driven asymmetry. The model does not merely attend to the past; it constructs a directed acyclic graph (DAG) of computation on-the-fly. The nodes are tokens, and the DAA biases are the learned weights of the directed edges.
In this example, when the query is the final result of “11”, the model has learned to assign high-weight edges originating from the numerical inputs “5” and “6” as well as from the intermediate step “5 + 6”. This is not a simple pattern, but a learned computational graph. The negative biases are equally important; they represent a learned pruning of the graph in which the model actively decides to ignore irrelevant information (e.g., “John”, “apples”), thereby reducing noise and focusing its “cognitive” resources. This learned asymmetry directly mirrors the structure of logical dependency, providing a far more faithful and interpretable view of the model’s “thought process” than raw attention scores, which are often diffuse and difficult to interpret [
25].
5.4. The Effect of Pretraining Dynamics and Sample Efficiency
Our zero-initialization strategy ensures that DAA starts identically to the baseline. The DAA model’s curve diverges downward, indicating that it learns a more effective representation, leading to faster convergence and a lower final perplexity. As in
Figure 2, the learning curve provides strong evidence for improved sample efficiency. The DAA model not only reaches a better final state, it learns faster. This suggests that the DAA architecture provides a more suitable inductive bias for language modeling from the start. A standard transformer must expend a significant portion of its capacity and training data to implicitly learn how to filter and prioritize its context. By providing an explicit specialized mechanism for this task in the form of the Guidance Network, DAA frees up the model’s main parameters to focus on learning semantic and syntactic representations. The stable divergence from the baseline enabled by our zero-initialization strategy confirms that the DAA mechanism is not merely a regularizer but a genuine architectural improvement that the model actively learns to leverage in order to better predict the next token. This improved efficiency is a highly desirable property, potentially reducing the computational cost of training highly capable models.
5.5. Probing the Mechanisms of DAA
To further validate that the performance gains stem from DAA’s hypothesized mechanisms, we conducted three additional targeted experiments.
5.5.1. Robustness to Informational Distractors
A key claim is that DAA learns to suppress irrelevant information. To test this directly, we designed a “distractor” experiment using the SQuAD 1.1 development set. For each question–context pair, we synthetically inserted three to five irrelevant sentences randomly sampled from other Wikipedia articles into the middle of the context paragraph. A robust model should be able to ignore these distractors and maintain its F1 score.
As shown in
Figure 3, the baseline models suffer significant declines in performance when faced with distractors. In contrast, the DAA-enhanced models exhibit much greater resilience, with their F1 scores degrading by less than a third of the drop shown by the baselines. This provides strong direct evidence that the learned negative biases in DAA are not an artifact but a functional mechanism for actively pruning irrelevant information from the context, leading to more robust and focused reasoning.
5.5.2. Mitigating the “Lost in the Middle” Challenge
We investigated DAA’s impact on long-context processing, specifically the “lost in the middle” phenomenon, in which models struggle to utilize information located in the center of their context window [
7]. We used a synthetic key–value retrieval task. A random key–value pair (e.g., “The special code is: 7B-D4A”) was inserted into documents of 4096 tokens at varying depths (from 0% to 100% of the document length). The model was then prompted to retrieve the value for the given key.
Figure 4 starkly illustrates the issue and our solution. The baseline Mistral model’s performance forms a “U” shape, with accuracy plummeting to below 50% when the key is in the middle of the context. The Mistral-DAA model, on the other hand, shows a much flatter and higher curve, maintaining close to 90% accuracy even at the most challenging central position. This indicates that DAA’s ability to learn content-based importance allows it to “reach” into the context and amplify a crucial piece of information regardless of its position. This overcomes the implicit bias of standard attention mechanisms, which often over-privilege the beginning and end of the context, demonstrating a fundamental improvement in long-range information integration.
5.5.3. Architectural Efficiency vs. Naive Scaling
A critical question is whether DAA’s benefits are truly architectural or could be replicated by simply adding a similar number of parameters to the baseline model. We created a control model, LLaMA-2-7B-Dense+, by increasing the hidden dimension of the FFN layers in the baseline LLaMA-2 7B to match the total parameter count of our LLaMA-2-7B-DAA model (≈8.4 M additional parameters).
Table 4 shows a clear and compelling result. Simply making the baseline model “denser” provides only a marginal improvement on GSM8k. In stark contrast, using the same parameter budget to implement the DAA mechanism yields a massive performance leap. This result decisively demonstrates that the gains from DAA are not due to the mere addition of parameters but to the
structured nature of the modification. DAA’s structured addition of parameters is far more effective than naively increasing model density. DAA provides a specialized and efficient mechanism for learning information flows, an architectural improvement that naive scaling cannot replicate. This confirms that the proposed “asymmetric dynamic design” is the key driver of the observed success.
5.6. Computational Cost Analysis
While the previous sections have demonstrated the performance benefits of DAA, a practical evaluation also requires a clear analysis of its computational overhead. The introduction of the Guidance Network () adds parameters and computations to the standard transformer block. This section provides a detailed breakdown of this cost.
5.6.1. Parameter Overhead
The additional parameters from DAA stem from the lightweight MLP used for the Guidance Network. The number of parameters for a single shared Guidance Network is provided by
where
(from concatenating the query and key vectors) and
is the hidden dimension. For our LLaMA-2-7B models, we apply DAA to half of the transformer layers (16 out of 32). With our chosen configuration (
), the total parameter overhead is approximately
8.4 million. As shown in
Table 5, this constitutes a negligible fraction of the total model size.
5.6.2. FLOPs and Asymptotic Complexity
As noted in
Section 3, the DAA module does not alter the asymptotic time complexity of the attention layer, which remains dominated by the
attention score computation. The additional FLOPs come from the forward pass of the Guidance Network for each of the
query–key pairs. The cost of this is proportional to
. Because we intentionally keep
small relative to the model’s hidden size (
), the increase in the constant factor of the overall computation is modest. The core computational bottleneck remains the main attention matrix calculation.
5.6.3. Empirical Runtime Overhead
To measure the real-world impact on speed, we benchmarked the end-to-end throughput of the LLaMA-2-7B base model against our LLaMA-2-7B-DAA model on our evaluation hardware (32 NVIDIA A100 80GB GPUs).
Training: We observed a ~9% decrease in training throughput (measured in tokens per second).
Inference: We measured a ~12% increase in latency for autoregressively generating a single sequence of 2048 tokens.
These empirical results confirm that DAA introduces a manageable overhead.
5.6.4. Summary of the Tradeoff
Table 5 summarizes the tradeoff between the computational cost and the performance benefit on the GSM8k reasoning benchmark.
The analysis clearly shows that DAA offers a highly favorable cost–benefit ratio. The substantial +23.6 absolute point improvement on a complex reasoning task is achieved with a negligible increase in model size and a modest (single-digit to low-double-digit) percentage impact on training and inference speed. This positions DAA as a computationally viable and effective architectural enhancement.
5.7. The Impact of Initialization on Training Stability
In this section, we conducted an experiment comparing the training stability of two models: our proposed LLaMA-2-7B-DAA model and an identical model in which the DAA output layer was initialized using a standard Kaiming (He) initialization instead of zeros. Both models were trained on the RedPajama-1T subset for the first 5000 steps, using the same hyperparameters.
Figure 5 plots the validation loss for both models during this initial training phase.
The results are unambiguous.
DAA with Kaiming Initialization (Unstable): The model begins with a validation loss far exceeding that of the base model. This initial loss spike confirms that injecting a random untrained bias into the attention scores catastrophically disrupts the model’s pretrained knowledge. While the optimizer eventually wrestles the gradients under control, the model’s learning trajectory is clearly compromised.
DAA with Zero-Initialization (Stable): The model’s initial validation loss is identical to that of the base LLaMA-2 7B model. The loss curve shows a smooth monotonic decrease from the very first step. This demonstrates that our initialization strategy successfully preserves the model’s initial state and allows for a stable and effective learning process.
This comparison provides strong empirical evidence that our zero-initialization strategy is a necessary precondition for the successful and stable application of architectural augmentations such as DAA to large pretrained language models.
5.8. Comparative Visualization of Attention Mechanisms
While the DAA bias visualization in the previous section reveals the learned policy, its true value is best understood when compared directly against the final attention patterns of other mechanisms. To this end,
Figure 6 provides a side-by-side comparison of the final post-softmax attention weights from the baseline model, an ALiBi-enhanced model, and our DAA-enhanced model for the exact same reasoning task. The visualization shows the attention distribution from the final query token (
=) to all preceding context tokens.
The differences in interpretability are stark:
Baseline Attention: The attention distribution is diffuse and unfocused. While there are minor peaks on some numbers, the model struggles to distinguish relevant operands from irrelevant context. The pattern lacks a clear logical structure, making it difficult to interpret the model’s reasoning path.
ALiBi: This visualization perfectly illustrates the limitation of a static positional heuristic. The attention is overwhelmingly concentrated on the tokens closest to the query (+, 6, =). It completely ignores the distant but crucial operand “5”. This demonstrates that ALiBi is not performing logical reasoning but is simply applying a fixed “nearby is important” rule, which fails in this non-local reasoning context.
DAA: The DAA-enhanced model’s attention pattern is a clear and interpretable reasoning graph. It places sharp high-confidence attention weights on exactly the tokens required for the final computation: “5”, “+”, and “6” (and the first “5”). It has learned to suppress attention to all other tokens, including the nearby but irrelevant “A:” and the distant and irrelevant “John” and “apples”.
This direct comparison provides powerful visual evidence that DAA transforms the attention mechanism from a simple similarity or proximity measure into a true task-aware reasoning engine. It learns not just what to attend to, but why, creating a far more transparent and interpretable model.
5.9. Our Investigative Strategy: From Foundational Impact to Mechanistic Insight
Our approach to evaluating DAA is designed to build a comprehensive case, moving from broad validation to specific mechanistic understanding. The investigation proceeds in three main stages:
First, we must establish that DAA is a valid architectural enhancement for fundamental language modeling. Before assessing complex reasoning, we verify that our modification does not harm, and ideally improves, the model’s core ability to process and predict natural language. This is achieved through continued pretraining on a large general-purpose corpus (RedPajama) and measuring its impact on a standard language modeling metric, validation perplexity on the C4 dataset. A lower perplexity indicates that DAA improves the model’s core predictive capabilities and sample efficiency.
The core claim of our paper is that DAA enhances reasoning. The most direct test of this hypothesis is to evaluate the intrinsic zero-shot performance of our DAA-enhanced models on tasks that explicitly require complex non-local logic. Tasks such as GSM8k (mathematical reasoning) and HumanEval (code generation) were chosen specifically because their solutions require the model to identify and connect multiple, often distant, pieces of information to construct a coherent multi-step argument. Strong performance on these tasks provides direct evidence that DAA has improved the model’s intrinsic reasoning architecture, which is the primary downstream impact we aim to demonstrate.
Finally, to understand why and how DAA is effective, we dissect the mechanism through a series of targeted analyses. This stage answers the crucial questions about the source of the performance gains.
Ablation Studies: These experiments rule out confounding factors, such as proving that the gains are not simply due to adding more parameters but rather to DAA’s specific structure.
Targeted Probing Experiments: These experiments directly test DAA’s hypothesized capabilities, such as its robustness to informational distractors and its ability to mitigate the “lost in the middle” problem.
Qualitative Visualization: These visualizations provide direct visual proof of the learned reasoning process, showing how DAA enables a sharp surgical focus on logically relevant tokens.
This multi-faceted approach allows us to build a comprehensive case connecting the architectural change of DAA to its direct impact on downstream reasoning tasks and its underlying mechanisms.
6. Implications for Fine-Tuning and Future Work
While this work has focused on demonstrating the benefits of DAA through continued pretraining and zero-shot evaluation, the architectural nature of our enhancement has profound and promising implications for fine-tuning and transfer learning settings. We view this as a critical avenue for future research.
Enhanced Sample Efficiency and a Superior Starting Point: A core challenge in fine-tuning is adapting a general-purpose model to a specific task with a limited number of examples. We hypothesize that a DAA-enhanced model provides a significantly better starting point for this process. By already possessing a more sophisticated learned mechanism for information routing and logical dependency tracking, the model has a stronger inductive bias for reasoning. This should translate directly into improved sample efficiency, requiring fewer labeled examples to reach high performance on downstream tasks.
Adaptable Task-Specific Reasoning Policies: Perhaps the most exciting prospect is the ability of the Guidance Network itself to be fine-tuned. During task-specific fine-tuning, the network’s parameters () would be updated along with the rest of the model. This means that DAA could allow the model to learn a specialized reasoning policy tailored to the new task. For instance:
On a summarization task, the Guidance Network could learn to assign a strong positive bias to information from the introductory and concluding paragraphs of a source document.
On a code generation task from docstrings, it could learn to dynamically link function parameters to their usage within the code body.
On a multi-hop question answering task, the learned biases could explicitly trace the path of evidence from one fact to the next.
This goes beyond simply adapting the model’s knowledge to additionally adapt the model’s reasoning process.
Future Experimental Directions: Validating these hypotheses requires a comprehensive empirical study, which we propose as future work. This would involve fine-tuning DAA-enhanced models against standard baselines on a diverse suite of tasks from benchmarks such as GLUE and SuperGLUE. Key metrics would include not only final performance but also the learning rate and sample efficiency. Furthermore, analyzing how the learned DAA bias matrices change from their pretrained state to their fine-tuned state could provide unprecedented insight into how LLMs adapt their internal reasoning strategies for specific tasks.
7. Applicability to Bidirectional Models (e.g., BERT)
While our experiments have focused on autoregressive LLMs, the core principle of DAA—learning a dynamic, content-aware asymmetry—is a general one that can be adapted to bidirectional encoder models such as BERT. The role and benefit of DAA in this context would be conceptually distinct but equally powerful.
In autoregressive models, a static causal mask already imposes a hard asymmetry. DAA’s role is to refine this, learning which of the available past tokens are most important. In contrast, the self-attention in a bidirectional model such as BERT is fundamentally symmetric, with the attention from token i to token j computed using the same initial potential as from j to i.
When applied to a bidirectional model, DAA would
introduce a learned asymmetry. It would transform the standard attention mechanism, which effectively models an
undirected graph of token relationships, into one that models a
learned directed graph. The modification would be straightforward; the DAA bias matrix would be computed for all query–key pairs
and added to the attention scores without a causal mask:
where
is now a dense and potentially asymmetric matrix in which
.
This learned directionality could be highly beneficial for a range of NLP tasks that rely on understanding structured directional relationships:
Syntactic Structure: DAA could learn to model the directed dependencies of a sentence’s parse tree, for instance by creating a strong bias from a verb to its subject but not vice-versa.
Causal and Logical Inference: For tasks such as relation extraction, DAA could learn to represent cause-and-effect relationships directionally. In the sentence ”Heavy rain caused the flood,” it might learn a strong positive bias from rain to flood, but a neutral or negative bias in the opposite direction.
Coreference Resolution: The mechanism could learn a directed link from a pronoun (e.g., “it”) back to its antecedent (e.g., “the ball”), explicitly modeling the anaphoric relationship.
Applying DAA in this context would involve computing a dense bias matrix, which carries a higher computational cost than the triangular matrix used in the autoregressive setting. However, for tasks where understanding relational structure and directed information flow is paramount, this could be a worthwhile tradeoff. Investigating the impact of DAA on bidirectional models for tasks in natural language understanding represents a promising direction for future work.
8. Conclusions
In this work, we have challenged the foundational paradigm of static unidirectional causality that underpins modern autoregressive large language models. By introducing DAA, we have demonstrated that replacing this rigid content-agnostic asymmetry with a flexible and learnable policy for information flow yields substantial benefits. Our comprehensive experiments show that this architectural shift not only improves performance on both foundational language modeling and complex reasoning benchmarks but also provides a powerful defense against known failure modes such as the “lost in the middle” problem and vulnerability to informational distractors. Crucially, DAA offers a novel and intuitive lens for interpretability, transforming the opaque attention mechanism into a transparent and visualizable reasoning graph that reveals the model’s emergent computational strategies. This research provides compelling evidence that the future of more capable, robust, and trustworthy AI lies in moving beyond fixed constraints and empowering models to learn their own dynamic asymmetric processing pathways, opening up new avenues for scientific discovery and advanced reasoning in machines.
Although Dynamic Asymmetric Attention significantly enhances the capabilities of large language models, there are some noteworthy considerations in its design. First, DAA is intended to augment rather than completely replace existing causal masks, as the latter are still critical to the generative stability of autoregressive models. Second, DAA introduces additional parameters and computational overheads, although these increases are generally manageable and parameter-efficient. In addition, optimal performance of DAA relies on careful tuning of specific hyperparameters such as the hidden dimensions of the steering network and the density of the layers, and improper configurations may lead to inefficiencies.
Author Contributions
Conceptualization, C.L.; methodology, F.W., X.L., and H.Y.; software, F.W. and H.L.; investigation, X.L. and C.L.; data curation, C.L. and X.S.; writing—original draft, F.W.; writing—review and editing, C.L., C.L., and X.S. All authors have read and agreed to the published version of the manuscript.
Funding
This work is partly supported by the Henan Province Philosophy and Social Sciences Planning Project: Research on Promoting People-Centered New Urbanization in Henan (No. 2022BJJ026); the Henan Provincial Science and Technology Research Program: Key Technologies for Crop Identification and Yield Estimation Using Remote Sensing Data (242102320345); the Foundation and Cutting-Edge Technologies Research Program of Henan Province (No. 252102211067, No. 252102210064, No. 252102210124); and the Research and Practice Project on Higher Education Teaching Reform in Henan Province (2024SJGLX0133).
Data Availability Statement
The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author. Restrictions apply to the availability of these data due to institutional policies and the protection of proprietary information.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Zhu, S.; Xu, S.; Sun, H.; Pan, L.; Cui, M.; Du, J.; Jin, R.; Branco, A.; Xiong, D. Multilingual Large Language Models: A Systematic Survey. arXiv 2024, arXiv:2411.11072. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. arXiv 2022, arXiv:2204.02311. [Google Scholar] [CrossRef]
- Zhu, S.; Cui, M.; Xiong, D. Towards robust in-context learning for machine translation with large language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 16619–16629. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
- Dong, T.; Li, B.; Liu, J.; Zhu, S.; Xiong, D. MLAS-LoRA: Language-Aware Parameters Detection and LoRA-Based Knowledge Transfer for Multilingual Machine Translation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 15645–15660. [Google Scholar]
- Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the Middle: How Language Models Use Long Contexts. arXiv 2023, arXiv:2307.03172. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Chi, E.; Le, Q.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
- Yao, S.; Yu, D.; Zhao, J.; Sha, I.; Savarese, S.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Adv. Neural Inf. Process. Syst. 2023, 36, 11809–11822. [Google Scholar]
- Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar] [CrossRef]
- Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 2527–2540. [Google Scholar]
- Zaheer, M.; Guruganesh, G.; Dubey, A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33, pp. 17283–17297. [Google Scholar]
- Su, J.; Lu, Y.; Pan, S.; Murtadha, A.; Wen, B.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. arXiv 2021, arXiv:2104.09864. [Google Scholar] [CrossRef]
- Press, O.; Smith, N.; Lewis, M. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. arXiv 2021, arXiv:2108.12409. [Google Scholar]
- Sun, Y.; Dong, L.; Patra, B.; Ma, S.; Huang, S.; Benhaim, A.; Chaudhary, V.; Song, X.; Wei, F. A Length-Extrapolatable Transformer. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 14590–14604. [Google Scholar]
- Duan, S.; Shi, Y.; Xu, W. From interpolation to extrapolation: Complete length generalization for arithmetic transformers. arXiv 2023, arXiv:2310.11984. [Google Scholar]
- Veisi, A.; Amirzadeh, H.; Mansourian, A. Context-aware Biases for Length Extrapolation. arXiv 2025, arXiv:2503.08067. [Google Scholar]
- Zhu, S.; Pan, L.; Xiong, D. FEDS-ICL: Enhancing translation ability and efficiency of large language model by optimizing demonstration selection. Inf. Process. Manag. 2024, 61, 103825. [Google Scholar] [CrossRef]
- Yao, Y.; Li, Z.; Zhao, H. Beyond chain-of-thought, effective graph-of-thought reasoning in language models. arXiv 2023, arXiv:2305.16582. [Google Scholar]
- Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1 May–5 May 2023. [Google Scholar]
- Zhang, Z.; Zhang, A.; Li, M.; Smola, A. Automatic chain of thought prompting in large language models. arXiv 2022, arXiv:2210.03493. [Google Scholar] [CrossRef]
- Diao, S.; Wang, P.; Lin, Y.; Pan, R.; Liu, X.; Zhang, T. Active Prompting with Chain-of-Thought for Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 1330–1350. [Google Scholar]
- Zhu, S.; Pan, L.; Li, B.; Xiong, D. LANDeRMT: Dectecting and Routing Language-Aware Neurons for Selectively Finetuning LLMs to Machine Translation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 12135–12148. [Google Scholar]
- Zhu, S.; Pan, L.; Jian, D.; Xiong, D. Overcoming language barriers via machine translation with sparse Mixture-of-Experts fusion of large language models. Inf. Process. Manag. 2025, 62, 104078. [Google Scholar] [CrossRef]
- Jain, S.; Wallace, B.C. Attention is not explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 3543–3556. [Google Scholar]
- Wiegreffe, S.; Pinter, Y. Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5983–5993. [Google Scholar]
- He, G.; Song, X.; Sun, A. Knowledge updating? no more model editing! just selective contextual reasoning. arXiv 2025, arXiv:2503.05212. [Google Scholar]
- Xu, D.; Zhang, Z.; Zhu, Z.; Lin, Z.; Liu, Q.; Wu, X.; Xu, T.; Wang, W.; Ye, Y.; Zhao, X.; et al. Editing factual knowledge and explanatory ability of medical large language models. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 2660–2670. [Google Scholar]
- Belinkov, Y. Probing classifiers: Promises, pitfalls, and a better way. Trans. Assoc. Comput. Linguist. 2022, 10, 1113–1127. [Google Scholar]
- Meng, K.; Bau, D.; Andonian, A.; Belinkov, Y. Locating and editing factual associations in gpt. Adv. Neural Inf. Process. Syst. 2022, 35, 17359–17372. [Google Scholar]
- Sharma, A.S.; Atkinson, D.; Bau, D. Locating and editing factual associations in mamba. arXiv 2024, arXiv:2404.03646. [Google Scholar] [CrossRef]
- Computer, T. RedPajama-1T: An Open, Reproducible, 1.2 Trillion Token Dataset for Training Large Language Models. 2023. Available online: https://simonwillison.net/2023/Apr/17/redpajama-data/ (accessed on 11 March 2025).
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
- Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pires, H.P.d.O.; Le, Q.; Luan, Y.; Jiang, H.; Misra, I.; Krueger, G.; et al. Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
- Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. Training verifiers to solve math word problems. arXiv 2021, arXiv:2110.14168. [Google Scholar] [CrossRef]
- Pang, R.Y.; Parrish, A.; Joshi, N.; Nangia, N.; Phang, J.; Chen, A.; Padmakumar, V.; Ma, J.; Thompson, J.; He, H.; et al. QuALITY: Question Answering with Long Input Texts, Yes! In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 4026–4043. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).