Mitigating Hallucinations in Discipline Inspection QA: A Two-Stage RAG Framework with Late Interaction and Reranking
Abstract
1. Introduction
- A complete and reproducible integration of the original ColBERTv2 retrieval pipeline into a real-world Chinese legal QA system, addressing a notable absence in existing research.
- A systematic two-stage retrieval architecture that combines ColBERTv2’s late interaction with cross-encoder reranking, empirically validated on long legal document benchmarks.
- A reproducible, end-to-end system encompassing document preprocessing, indexing, hierarchical retrieval, and grounded generation, with full code and evaluation data released to the community.
2. Related Work
2.1. Legal Information Retrieval and Legal QA
2.2. ColBERT and Late Interaction
2.3. Two-Stage Retrieval and Reranking
3. Methodology
3.1. Formal Specification of Forensic-Grade Document Segmentation and Provenance Mapping

3.2. Formalized High-Recall Dense Retrieval via ColBERTv2 with PLAID Pruning
3.3. Precision-Oriented Reranking and Legally Compliant Answer Generation
| Algorithm 1 Traceable Retrieval-Augmented Generation for Disciplinary Inspection |
|
4. Experiments
4.1. Experimental Setup
4.2. Comprehensive Performance Analysis
4.2.1. Quantitative Benchmarking
4.2.2. Human Expert Preference Study
4.2.3. Discussion and Implications
4.3. Ablation Studies
4.3.1. Effect of Reranking Intensity
4.3.2. Effect of First-Stage Recall Size
4.3.3. Impact of Overlapping Chunking and Provenance Metadata
4.3.4. Influence of Generation Temperature and Prompt Strictness
5. Discussion: System Efficiency and Deployment Considerations
5.1. GPU Memory Footprint vs. Retrieval Depth
5.2. Deployment on Edge or On-Premise Systems
5.3. Comparison with Long-Context LLMs from a Hardware-Efficiency Perspective
6. Materials
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Surden, H. Artificial intelligence and law: An overview. Ga. State Univ. Law Rev. 2018, 35, 1305–1337. [Google Scholar]
- Locke, D.; Zuccon, G. Case law retrieval: Problems, methods, challenges and evaluations in the last 20 years. arXiv 2022, arXiv:2202.07209. [Google Scholar] [CrossRef]
- Supreme People’s Court of China. China Judgments Online Statistics (as of 2024). Available online: https://wenshu.court.gov.cn (accessed on 15 December 2024).
- Robertson, S.; Zaragoza, H. The probabilistic relevance framework: BM25 and beyond. Found. Trends® Inf. Retr. 2009, 3, 333–389. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
- Lin, S.; Hilton, J.; Evans, O. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 3214–3252. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
- Cuconasu, F.; Trappolini, G.; Siciliano, F.; Filice, S.; Campagnano, C.; Maarek, Y.; Tonellotto, N.; Silvestri, F. The power of noise: Redefining retrieval for RAG systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 719–729. [Google Scholar]
- Fan, W.; Ding, Y.; Ning, L.; Wang, S.; Li, H.; Yin, D.; Chua, T.; Li, Q. A survey on RAG meeting LLMs: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 6491–6501. [Google Scholar]
- Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. BGE M3-Embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv 2024, arXiv:2402.03216. [Google Scholar]
- Wang, L.; Yang, N.; Huang, X.; Jiao, B.; Yang, L.; Jiang, D.; Majumder, R.; Wei, F. Text embeddings by weakly-supervised contrastive pre-training. arXiv 2022, arXiv:2212.03533. [Google Scholar]
- Khattab, O.; Zaharia, M. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, China, 25–30 July 2020; pp. 39–48. [Google Scholar]
- MacAvaney, S.; Nardini, F.M.; Perego, R.; Tonellotto, N.; Goharian, N.; Frieder, O. Efficient document re-ranking for transformers by precomputing term representations. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, China, 25–30 July 2020; pp. 49–58. [Google Scholar]
- Santhanam, K.; Khattab, O.; Saad-Falcon, J.; Potts, C.; Zaharia, M. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 3715–3734. [Google Scholar]
- Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.S.H.; Wu, L.; Edunov, S.; Chen, D.; Yih, W. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Virtual Event, 16–20 November 2020; pp. 6769–6781. [Google Scholar]
- Xiong, L.; Xiong, C.; Li, Y.; Tang, K.; Liu, J.; Bennett, P.; Ahmed, J.; Overwijk, A. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv 2020, arXiv:2007.00808. [Google Scholar] [CrossRef]
- Guha, N.; Nyarko, J.; Ho, D.; Ré, C.; Chilton, A.; Chohlas-Wood, A.; Peters, A.; Waldon, B.; Rockmore, D.; Zambrano, D.; et al. LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models. Adv. Neural Inf. Process. Syst. 2023, 36, 44123–44279. [Google Scholar] [CrossRef]
- Xiao, C.; Zhong, H.; Guo, Z.; Tu, C.; Liu, Z.; Sun, M.; Feng, Y.; Han, X.; Hu, Z.; Wang, H.; et al. CAIL2018: A large-scale legal dataset for judgment prediction. arXiv 2018, arXiv:1807.02478. [Google Scholar] [CrossRef]
- Li, H.; Shao, Y.; Wu, Y.; Ai, Q.; Ma, Y.; Liu, Y. LeCaRDv2: A large-scale Chinese legal case retrieval dataset. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 2251–2260. [Google Scholar]
- Zhong, H.; Xiao, C.; Tu, C.; Zhang, T.; Liu, Z.; Sun, M. JEC-QA: A legal-domain question answering dataset. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9701–9708. [Google Scholar]
- Thakur, N.; Reimers, N.; Rücklé, A.; Srivastava, A.; Gurevych, I. BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv 2021, arXiv:2104.08663. [Google Scholar] [CrossRef]
- Saad-Falcon, J.; Fu, D.Y.; Arora, S.; Guha, N.; Ré, C. Benchmarking and building long-context retrieval models with LoCo and M2-BERT. arXiv 2024, arXiv:2402.07440. [Google Scholar]
- Santhanam, K.; Khattab, O.; Potts, C.; Zaharia, M. PLAID: An efficient engine for late interaction retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 1747–1756. [Google Scholar]
- Nair, S.; Yang, E.; Lawrie, D.; Duh, K.; McNamee, P.; Murray, K.; Mayfield, J.; Oard, D.W. Transfer learning approaches for building cross-language dense retrieval models. In Proceedings of the European Conference on Information Retrieval, Stavanger, Norway, 10–14 April 2022; pp. 382–396. [Google Scholar]
- Khattab, O.; Potts, C.; Zaharia, M. Relevance-guided supervision for OpenQA with ColBERT. Trans. Assoc. Comput. Linguist. 2021, 9, 929–944. [Google Scholar] [CrossRef]
- Zhang, Y.; Long, D.; Xu, G.; Xie, P. HLATR: Enhance multi-stage text retrieval with hybrid list aware transformer reranking. arXiv 2022, arXiv:2205.10569. [Google Scholar] [CrossRef]
- Es, S.; James, J.; Anke, L.E.; Schockaert, S. RAGAS: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, St. Julian’s, Malta, 17–22 March 2024; pp. 150–158. [Google Scholar]
- Yang, A.; Yang, B.; Hui, B.; Zheng, B.; Yu, B.; Zhou, C.; Li, C.; Li, C.; Liu, D.; Huang, F.; et al. Qwen2 technical report. arXiv 2024, arXiv:2407.10671. [Google Scholar]
- Zirnstein, B. Extended Context for InstructGPT with LlamaIndex; Technical Report; Hochschule für Wirtschaft und Recht Berlin: Berlin, Germany, 2023. [Google Scholar]
- Zhao, C.; Deng, C.; Ruan, C.; Dai, D.; Gao, H.; Li, J.; Zhang, L.; Huang, P.; Zhou, S.; Ma, S.; et al. Insights into DeepSeek-V3: Scaling challenges and reflections on hardware for AI architectures. In Proceedings of the 52nd Annual International Symposium on Computer Architecture, Vancouver, BC, Canada, 28 June–2 July 2025; pp. 1731–1745. [Google Scholar]
- Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar] [CrossRef]
- Chen, W.; Chen, J.; Zou, F.; Li, Y.; Lu, P.; Wang, Q.; Zhao, W. Vector and line quantization for billion-scale similarity search on GPUs. Future Gener. Comput. Syst. 2019, 99, 295–307. [Google Scholar]



| Parameter | Value |
|---|---|
| Model Checkpoint, | colbert-ir/colbertv2.0 |
| Quantization Scheme | 2-bit PLAID |
| Batch Size, | 8 |
| Index Path | ./colbert_index |
| Method | Faith. LLM Score | Compl. LLM Score | Ans-Rel. LLM Score | Clarity LLM Score | Con-Rel. LLM Score | Overall LLM Score |
|---|---|---|---|---|---|---|
| Set 1 (10 K–50 K Tokens) | ||||||
| Long-context [29] | 78.3 ± 1.9 | 75.6 ± 2.1 | 67.5 ± 2.4 | 81.3 ± 1.7 | 87.1 ± 1.8 | 76.0 ± 1.6 |
| RAG [8] | 76.7 ± 2.0 | 72.8 ±2.3 | 67.8 ±2.5 | 74.6 ±2.2 | 83.3 ± 2.0 | 73.3 ± 1.8 |
| Llamalandex + FAISS [30] | 77.4 ± 1.8 | 75.4 ± 2.0 | 70.3 ± 2.2 | 75.7 ± 2.1 | 80.0 ± 2.3 | 75.0 ± 1.7 |
| LawRAG (ours) | 78.4 ± 1.8 | 77.5 ± 1.9 | 78.5 ± 2.1 | 77.1 ± 2.0 | 76.0 ± 2.2 | 79.8 ± 1.6 |
| Set 2 (50 K–150 K Tokens) | ||||||
| Long-context [29] | 70.7 ± 2.4 | 73.9 ± 2.2 | 62.6 ± 2.7 | 85.2 ± 1.9 | 38.5 ± 3.1 | 68.9 ± 2.0 |
| RAG [8] | 75.3 ± 2.1 | 70.3 ± 2.5 | 66.0 ± 2.6 | 60.1 ± 2.8 | 80.2 ± 2.2 | 70.5 ± 1.9 |
| Llamalandex + FAISS [30] | 78.4 ± 1.9 | 75.4 ± 2.1 | 76.3 ± 2.2 | 75.9 ± 2.3 | 72.3 ± 2.4 | 76.1 ± 1.8 |
| LawRAG (ours) | 78.8 ± 1.9 | 77.1 ± 2.0 | 77.9 ± 2.1 | 78.4 ± 2.1 | 73.1 ± 2.3 | 76.4 ± 1.7 |
| Set 3 (150 K–300 K Tokens) | ||||||
| Long-context [29] | 64.0 ± 2.8 | 65.2 ± 2.7 | 58.3 ± 3.0 | 60.8 ± 2.9 | 73.0 ± 2.6 | 56.3 ± 2.5 |
| RAG [8] | 65.3 ± 2.7 | 59.8 ± 3.1 | 60.0 ± 2.9 | 63.1 ± 2.8 | 70.6 ± 2.7 | 56.0 ± 2.4 |
| Llamalandex + FAISS [30] | 73.5 ± 2.4 | 74.1 ± 2.5 | 70.2 ± 2.6 | 73.0 ± 2.5 | 73.0 ± 2.5 | 65.3 ± 2.2 |
| LawRAG (ours) | 76.4 ± 2.2 | 76.0 ± 2.3 | 76.5 ± 2.4 | 77.2 ± 2.3 | 75.2 ± 2.3 | 74.1 ± 2.0 |
| Faithfulness | Completeness | Ans-Relevancy | Overall ↑ | Latency (ms) | |
|---|---|---|---|---|---|
| 50 | 71.08 | 52 | |||
| 100 | 75.95 | 68 | |||
| 200 | 77.50 | 92 | |||
| 400 | 77.71 | 168 |
| Faith. | Compl. | Ans-Rel. | Clarity and Logic | Con-Rel. | Overall ↑ | Latency (ms) | |
|---|---|---|---|---|---|---|---|
| 0 | 72.89 | 92 | |||||
| 5 | 77.50 | 92 | |||||
| 10 | 77.87 | 92 | |||||
| 20 | 77.95 | 92 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Hu, C.; Huang, Y.; Kuang, J.; Dai, B.; Peng, Y.; Xiao, Y.; Su, Y. Mitigating Hallucinations in Discipline Inspection QA: A Two-Stage RAG Framework with Late Interaction and Reranking. Electronics 2026, 15, 541. https://doi.org/10.3390/electronics15030541
Hu C, Huang Y, Kuang J, Dai B, Peng Y, Xiao Y, Su Y. Mitigating Hallucinations in Discipline Inspection QA: A Two-Stage RAG Framework with Late Interaction and Reranking. Electronics. 2026; 15(3):541. https://doi.org/10.3390/electronics15030541
Chicago/Turabian StyleHu, Changhua, Yuetian Huang, Jiexin Kuang, Bozhi Dai, Yun Peng, Yuxin Xiao, and Yi Su. 2026. "Mitigating Hallucinations in Discipline Inspection QA: A Two-Stage RAG Framework with Late Interaction and Reranking" Electronics 15, no. 3: 541. https://doi.org/10.3390/electronics15030541
APA StyleHu, C., Huang, Y., Kuang, J., Dai, B., Peng, Y., Xiao, Y., & Su, Y. (2026). Mitigating Hallucinations in Discipline Inspection QA: A Two-Stage RAG Framework with Late Interaction and Reranking. Electronics, 15(3), 541. https://doi.org/10.3390/electronics15030541

