ADR: Attention Head Detection and Reweighting Enhance RAG Performance in a Positional-Encoding-Free Paradigm
Abstract
1. Introduction
- Prior research has demonstrated that attention heads in large language models can be categorized into memory heads and context heads in counterfactual tasks. Moreover, suppressing memory heads significantly improves the model’s ability to perform such tasks [8].
- This suppression has a negative impact on the ability of large language models (LLMs) to extract knowledge from MLP sublayers. When the stored knowledge is inaccurate or missing due to the limitations of pre-training, it becomes even more important to utilize contextual knowledge, which is crucial for ensuring the robust performance of the retrieve-augmented generation (RAG) task [9,10].
- Efficiency: The proposed ADR introduces zero inference-time overhead in both memory and computational cost, which is a clear advantage over baseline models. Despite this efficiency, ADR still outperforms previous methods designed to enhance contextual awareness in RAG-based question answering tasks.
- Generality: he attention head detection and reweighting mechanism is agnostic to the model’s positional embedding, making it broadly applicable across diverse architectures. As shown in Section 4, ADR improves RAG performance in models employing different positional embedding algorithms (e.g., RoPE [11] and Alibi [12]).
- Compatibility with fine-tuning: The ADR structure can be seamlessly applied to fine-tuned large models, enhancing their contextual awareness without undermining the knowledge obtained during the fine-tuning process.
2. Relation Work
2.1. Context Awareness Enhancement
2.2. Prior Knowledge: Path Patching
2.3. LMs and RAG in the Oil and Gas Industry
3. Methodology
3.1. Discovery of RAG Suppressive Heads
- Successfully completing the recognition task requires the LMs to demonstrate the core functions necessary for the RAG framework, such as context-based retrieval and text generation. This makes the recognition task suitable as the discovery task of the rag suppression head.
- Minimize prior knowledge bias by leveraging the irrelevance and randomness between tokens in the input sequence. This setting aligns with real-world scenarios in which humans rely heavily on contextual knowledge, ensuring that the identified attention heads demonstrate general RAG functionality while remaining agnostic to any specific downstream task.
- During normal run, the clear input sequence is input into the model, and the outputs of all attention heads are cached.
- During the perturbed run, the same input sequence is processed through a forward pass, with certain intentional perturbations introduced. Specifically, the perturbation randomly replaces the object of either the first or the second subject within the input sequence, thereby affecting the hidden states in the residual stream. Subsequently, the outputs of all attention heads are recorded under the perturbed condition.
- A subsequent normal run is performed, during which the attention head at the specified position is intervened, while all other non-intervened heads are frozen. The intervention entails replacing with the perturbed output , whereas freezing means substituting the outputs of attention heads following the intervened head with their original outputs from the normal run, thereby preventing the final output from being influenced by the intervention during inference. Finally, the model’s output is recomputed. If the model’s output still assigns a high probability to the pre-intervention prediction, the attention head is identified as a RAG-suppression head.
- All proper tokens from the LLaMA vocabulary were selected during dataset construction. These were randomly combined to create contextually semantically unrelated sequences. Combinations that the model could accurately predict were identified and used as data for negative head detection. Partial examples of these data are shown in Figure 3b.
- Our intervention measurements are based on the logits difference, defined as:
- 3.
- For any given LMs, the recognition task was repeatedly performed multiple times with varying batch sizes; different batch sizes and randomly inserted IVS were used to enhance bias caused by context length. The final score for each head is the average result of these repeated experiments. Detailed experimental settings are provided in Section 4.
- 4.
- Based on the final scores, we identify the heads with the top-K most negative influence as a set S, defined as:
- 5.
- We do not specify the exact mechanisms by which these heads operate within RAG tasks; analyzing the specific mechanism for each head is not the focus of this paper and is left for future works.
3.2. Re-Weight Coefficient
- In downstream RAG tasks, the reweighting coefficients are task-agnostic and remain fixed.
- RAG-suppression heads are optimized once for each LMs via the recognition task. For a new RAG task, head discovery and coefficient learning do not need to be repeated.
4. Experiments
4.1. Experiments Setup
4.2. Comparison with Baselines on RAG Task
4.3. Application to LMs Using Different Position Embedding
- Efficiency: ADR introduces no additional inference-time memory or computational cost, as confirmed by the results in Table 3.
- Effectiveness: As shown in Table 2, ADR achieves consistent performance improvements across three RAG datasets, with the highest average gains.
- Generality: ADR proves robust across different positional embedding schemes (RoPE in LLaMA2-7B and Alibi in Baichuan-13B), highlighting its applicability to diverse model architectures.
4.4. The Effect of K
4.5. ADR Maintains Effectiveness in Oil and Gas Scenarios
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
References
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
- Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguist. 2024, 12, 157–173. [Google Scholar] [CrossRef]
- Lin, H.; Lv, A.; Song, Y.; Zhu, H.; Yan, R. Mixture of In-context experts enhance LLMs’ long context awareness. Adv. Neural Inf. Process. Syst. 2024, 37, 79573–79596. [Google Scholar]
- Zhang, Z.; Chen, R.; Liu, S.; Yao, Z.; Ruwase, O.; Chen, B.; Wu, X.; Wang, Z. Found in the middle: How language models use long contexts better via plug-and-play positional encoding. In Proceedings of the Thirty-eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
- Nanda, N.; Chan, L.; Lieberum, T.; Smith, J.; Steinhardt, J. Progress measures for grokking via mechanistic interpretability. arXiv 2023, arXiv:2301.05217. [Google Scholar] [CrossRef]
- Wang, K.; Variengien, A.; Conmy, A.; Shlegeris, B.; Steinhardt, J. Interpretability in the wild: A circuit for indirect object identification in gpt-2 small. arXiv 2022, arXiv:2211.00593. [Google Scholar] [CrossRef]
- Yu, Q.; Merullo, J.; Pavlick, E. Characterizing mechanisms for factual recall in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Association for Computational Linguistics: Singapore, 2023; pp. 9924–9959. [Google Scholar]
- Geva, M.; Schuster, R.; Berant, J.; Levy, O. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; Association for Computational Linguistics: Punta Cana, Dominican Republic, 2021; pp. 5484–5495. [Google Scholar]
- Meng, K.; Bau, D.; Andonian, A.; Belinkov, Y. Locating and editing factual associations in gpt. Adv. Neural Inf. Process. Syst. 2022, 35, 17359–17372. [Google Scholar]
- Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
- Press, O.; Smith, N.A.; Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv 2021, arXiv:2108.12409. [Google Scholar]
- Chen, Y.; Lv, A.; Lin, T.E.; Chen, C.; Wu, Y.; Huang, F.; Li, Y.B.; Yan, R. Fortify the shortest stave in attention: Enhancing context awareness of large language models for effective tool use. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 11160–11174. [Google Scholar]
- Zhang, Q.; Singh, C.; Liu, L.; Liu, X.; Yu, B.; Gao, J.; Zhao, T. Tell your model where to attend: Post-hoc attention steering for LLMs. In Proceedings of the Twelfth International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Li, W.; Zhang, Y.; Luo, G.; Yu, D.; Ji, R. Training long-context LLMs efficiently via chunk-wise optimization. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; Association for Computational Linguistics: Vienna, Austria, 2025; pp. 2691–2700. [Google Scholar]
- Li, M.; Xu, L.H.; Tan, Q.; Cao, T.; Liu, Y. Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management. arXiv 2025, arXiv:2508.04664. [Google Scholar] [CrossRef]
- Behrouz, A.; Li, Z.; Kacham, P.; Daliri, M.; Deng, Y.; Zhong, P.; Razaviyayn, M.; Mirrokni, V. Atlas: Learning to optimally memorize the context at test time. arXiv 2025, arXiv:2505.23735. [Google Scholar] [CrossRef]
- Yang, M.H.; Li, X.B.; Zeng, Q.; Li, X. The technical practice of large language models in the upstream business of oil and gas. China CIO News 2024, 61–65. [Google Scholar]
- Yang, M.H.; Li, X.B.; Liu, X.B.; Zeng, Q. The Application and Challenges of Large Artificial Intelligence Models in the Field of Oil and Gas Exploration and Development. Pet. Sci. Technol. Forum. 2024, 43, 107–113, 125. [Google Scholar]
- Wei, Q.; Sun, H.; Xu, Y.; Pang, Z.; Gao, F. Exploring the application of large language models based AI agents in leakage detection of natural gas valve chambers. Energies 2024, 17, 5633. [Google Scholar] [CrossRef]
- Eckroth, J.; Gipson, M.; Boden, J.; Hough, L.; Elliott, J.; Quintana, J. Answering natural language questions with OpenAI’s GPT in the petroleum industry. In Proceedings of the SPE Annual Technical Conference and Exhibition? San Antonio, TX, USA, 16–18 October 2023; SPE: Richardson, TX, USA, 2023; p. D031S032R005. [Google Scholar]
- Gong, Z.; Lv, A.; Guan, J.; Yan, J.; Wu, W.; Zhang, H.; Huang, M.; Zhao, D.; Yan, R. Mixture-of-modules: Reinventing transformers as dynamic assemblies of modules. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 20924–20938. [Google Scholar]
- Merullo, J.; Eickhoff, C.; Pavlick, E. Circuit component reuse across tasks in transformer language models. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Olah, C.; Cammarata, N.; Schubert, L.; Goh, G.; Petrov, M.; Carter, S. Zoom in: An introduction to circuits. Distill 2020, 5, e00024.001. [Google Scholar] [CrossRef]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- Cao, H.; Wu, Y.; Cai, Y.; Zhao, X.; Ou, Z. Improving End-to-End Training of Retrieval-Augmented Generation Models via Joint Stochastic Approximation. arXiv 2025, arXiv:2508.18168. [Google Scholar]
- Shi, Z.; Yan, L.; Sun, W.; Feng, Y.; Ren, P.; Ma, X.; Wang, S.; Yin, D.; de Rijke, M.; Ren, Z. Direct retrieval-augmented optimization: Synergizing knowledge selection and language models. arXiv 2025, arXiv:2505.03075. [Google Scholar] [CrossRef]
- Yang, P.; Li, X.; Hu, Z.; Wang, J.; Yin, J.; Wang, H.; He, L.; Yang, S.; Wang, S.; Huang, Y.; et al. HeteRAG: A Heterogeneous Retrieval-augmented Generation Framework with Decoupled Knowledge Representations. arXiv 2025, arXiv:2504.10529. [Google Scholar]
- Cong, Y.; Akash, P.S.; Wang, C.; Chang, K.C.C. Query optimization for parametric knowledge refinement in retrieval-augmented large language models. arXiv 2024, arXiv:2411.07820. [Google Scholar] [CrossRef]
- Wang, L.; Chen, H.; Yang, N.; Huang, X.; Dou, Z.; Wei, F. Chain-of-Retrieval Augmented Generation. arXiv 2025, arXiv:2501.14342. [Google Scholar]
- Ho, X.; Nguyen, A.K.D.; Sugawara, S.; Aizawa, A. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; International Committee on Computational Linguistics: Barcelona, Spain, 2020; pp. 6609–6625. [Google Scholar]
- Trivedi, H.; Balasubramanian, N.; Khot, T.; Sabharwal, A. ♫ MuSiQue: Multihop questions via single-hop question composition. Trans. Assoc. Comput. Linguist. 2022, 10, 539–554. [Google Scholar] [CrossRef]
- Dasigi, P.; Lo, K.; Beltagy, I.; Cohan, A.; Smith, N.A.; Gardner, M. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 4599–4610. [Google Scholar]
Model Name | for Discovered |
---|---|
Llama2-7B-chat-4K | (26,28), (26,9), (31,19), (31,28), (31,17), (31,11), (15,10), (15,14), (15,25), (15,8), (15,29), (18,9), (17,0), (14,15), (16,31), (16,18), (19,27), (19,23), (19,18), (25,21), (25,3), (30,9), (30,10), (29,15), (29,20), (13,9), (13,31), (12,6), (28,21), (23,8) |
Baichuan-13B-chat-4K | (26,22), (26,31), (25,24), (25,22), (25,10), (25,25), (28,26), (23,20), (23,16), (22,11), (22,21), (22,24), (22,16), (19,20), (32,13), (24,39), (24,15), (24,17), (24,31), (29,14), (29,22), (29,13), (30,29), (38,16), (27,17), (20,12), (20,20), (20,19), (21,27), (37,16) |
Position Embedding | Method | 2WikiMultiHopqa | Musique | Qasper | Avg |
---|---|---|---|---|---|
ROPE | ADR(Ours) ↑ | 44.0 | 17.5 | 19.0 | 26.83 |
MoICE | 44.5 | 16.5 | 16.5 | 25.83 | |
AB | 42.0 | 16.5 | 17.0 | 25.17 | |
Ms-PoE | 39.0 | 17.0 | 17.0 | 24.33 | |
Llama2-7B-chat-4k | 33.5 | 7.0 | 15.0 | 18.5 | |
Alibi | Baichuan-13B-chat-4K | 7.0 | 2.5 | 7.0 | 5.5 |
ADR(Ours) ↑ | 12.5 | 5.5 | 11.5 | 9.83 |
Method | 2WikiMultiHopqa | Musique | Qasper |
---|---|---|---|
Llama2-7B-chat-4k | 14.987/14.866 | 15.648/14.866 | 29.182/14.866 |
MoICE | 34.984(+19.999)/17.874(+3.008) | 39.686(+24.038)/17.874(+3.008) | 82.725(+53.543)/17.874(+3.008) |
AB | 35.639(+20.652)/50.880(+36.014) | 39.869(+24.221)/50.880(+36.014) | 42.538(+13.356)/50.880(+36.014) |
Ms-PoE | 33.533(+18.546)/22.423(+7.557) | 37.498(+21.850)/22.423(+7.557) | 66.698(+37.516)/22.423(+7.557) |
ADR(Ours) | 14.987(+0.00)/14.866(+0.00) | 15.648 (+0.00)/14.866(+0.00) | 29.182(+0.00)/14.866(+0.00) |
Method | 2WikiMultiHopQA | MuSiQue | Qasper | Avg |
---|---|---|---|---|
Llama2-7B-chat-4K(Baseline) | 35.5 | 9.0 | 16.0 | 20.17 |
Llama2-ablation | 38.0 | 15.0 | 16.5 | 23.17 |
Method | 2WikiMultiHopQA | MuSiQue | Qasper | Avg |
---|---|---|---|---|
Llama2-7B-chat-4K | 35.5 | 9.0 | 16.0 | 20.17 |
K = 10 | 39.5 | 16.5 | 16.5 | 24.17 |
K = 20 | 38.5 | 15.0 | 13.5 | 22.33 |
K = 30 | 38.0 | 18.5 | 15.0 | 23.83 |
K = 40 | 38.0 | 15.0 | 16.5 | 23.17 |
Model | Method | METEOR | ROUGE-L | CIDEr |
---|---|---|---|---|
PetroAI | - | 18.34 | 12.27 | 48.94 |
+RAG | 39.54 | 38.32 | 183.5 | |
+RAG+ADR ↑ | 64.23 | 49.62 | 317.3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, M.; Li, X.; Zeng, Q.; Liu, X.; Yang, M.; Jia, Z. ADR: Attention Head Detection and Reweighting Enhance RAG Performance in a Positional-Encoding-Free Paradigm. Information 2025, 16, 900. https://doi.org/10.3390/info16100900
Wang M, Li X, Zeng Q, Liu X, Yang M, Jia Z. ADR: Attention Head Detection and Reweighting Enhance RAG Performance in a Positional-Encoding-Free Paradigm. Information. 2025; 16(10):900. https://doi.org/10.3390/info16100900
Chicago/Turabian StyleWang, Mingwei, Xiaobo Li, Qian Zeng, Xingbang Liu, Minghao Yang, and Zhichen Jia. 2025. "ADR: Attention Head Detection and Reweighting Enhance RAG Performance in a Positional-Encoding-Free Paradigm" Information 16, no. 10: 900. https://doi.org/10.3390/info16100900
APA StyleWang, M., Li, X., Zeng, Q., Liu, X., Yang, M., & Jia, Z. (2025). ADR: Attention Head Detection and Reweighting Enhance RAG Performance in a Positional-Encoding-Free Paradigm. Information, 16(10), 900. https://doi.org/10.3390/info16100900