An Auditable LLM-RAG Architecture for Financial Document Intelligence and Decision Support
Abstract
1. Introduction
- We propose an auditability-focused LLM-RAG methodology for financial documents, based on state-machine orchestration, atomic state transitions, persistent provenance artifacts, and recoverable processing states.
- We define an evidence-centric retrieval and enrichment pipeline that combines semantic chunking, finance-oriented metadata extraction, dense–sparse retrieval, rank fusion, and late-interaction reranking.
- We introduce a validation layer that combines citation anchoring, schema-constrained JSON output, groundedness assessment, and strict numerical consistency checks for finance-critical values.
- We evaluate the approach on AI-FinanceQA and FinQA through retrieval metrics, groundedness, numerical consistency, structured-output compliance, latency analysis, enrichment ablation, and an auditability-oriented demonstration.
2. Related Work
3. Materials and Methods
3.1. Auditable LLM-RAG Pipeline
3.2. AI-FinanceQA Benchmark: Construction and Availability
4. Experimental Results
4.1. LLM Backend Selection
4.2. Retrieval Ablation Study
4.3. Financial Enrichment Ablation
4.4. Chunking Sensitivity and Latency Breakdown
4.5. Additional Experiments on FinQA: Answer Distance and Context-Utility
4.6. Auditability-Oriented Demonstration
4.7. Analyst-Facing Decision-Support Example
5. Discussion
6. Future Work
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Loughran, T.; McDonald, B. Textual Analysis in Accounting and Finance: A Survey. J. Account. Res. 2016, 54, 1187–1230. [Google Scholar] [CrossRef]
- Du, K.; Zhao, Y.; Mao, R.; Xing, F.; Cambria, E. Natural Language Processing in Finance: A Survey. Inf. Fusion 2025, 115, 102755. [Google Scholar] [CrossRef]
- Lee, J.; Stevens, N.; Han, S.C.; Song, M. Large Language Models in Finance (FinLLMs). Neural Comput. Appl. 2025, 37, 24853–24867. [Google Scholar] [CrossRef]
- Chen, Z.; Chen, W.; Smiley, C.; Shah, S.; Borova, I.; Langdon, D.; Moussa, R.; Beane, M.; Huang, T.H.; Routledge, B.; et al. FinQA: A Dataset of Numerical Reasoning over Financial Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3697–3711. [Google Scholar] [CrossRef]
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2023, arXiv:2312.10997. [Google Scholar] [CrossRef]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2020, arXiv:2005.11401. [Google Scholar] [CrossRef]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar] [CrossRef]
- Khattab, O.; Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval; ACM: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
- Chen, J.; Zhou, P.; Hua, Y.; Xin, L.; Chen, K.; Li, Z.; Zhu, B.; Liang, J. FinTextQA: A Dataset for Long-form Financial Question Answering. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 6025–6047. [Google Scholar] [CrossRef]
- Reddy, V.; Koncel-Kedziorski, R.; Lai, V.D.; Krumdick, M.; Lovering, C.; Tanner, C. DocFinQA: A Long-Context Financial Reasoning Dataset. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Bangkok, Thailand, 11–16 August 2024; pp. 445–458. [Google Scholar] [CrossRef]
- Islam, P.; Kannappan, A.; Kiela, D.; Qian, R.; Scherrer, N.; Vidgen, B. FinanceBench: A New Benchmark for Financial Question Answering. arXiv 2023, arXiv:2311.11944. [Google Scholar] [CrossRef]
- Zhou, W.; Zhang, S.; Poon, H.; Chen, M. Context-faithful Prompting for Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 14544–14556. [Google Scholar] [CrossRef]
- Stolfo, A. Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; pp. 1537–1552. [Google Scholar] [CrossRef]
- Cheng, H.; Shen, Y.; Liu, X.; He, P.; Chen, W.; Gao, J. UnitedQA: A Hybrid Approach for Open Domain Question Answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 3080–3090. [Google Scholar] [CrossRef]
- Chen, T.; Zhang, M.; Lu, J.; Bendersky, M.; Najork, M. Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models. In Advances in Information Retrieval; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022; Volume 13185, pp. 95–110. [Google Scholar] [CrossRef]
- Es, S.; James, J.; Espinosa Anke, L.; Schockaert, S. RAGAs: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, St. Julians, Malta, 17–22 March 2024; pp. 150–158. [Google Scholar] [CrossRef]
- Yu, H.; Gan, A.; Zhang, K.; Tong, S.; Liu, Q.; Liu, Z. Evaluation of Retrieval-Augmented Generation: A Survey. arXiv 2024, arXiv:2405.07437. [Google Scholar] [CrossRef]
- Ribeiro, M.T.; Wu, T.; Guestrin, C.; Singh, S. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4902–4912. [Google Scholar] [CrossRef]
- Zhang, K.; Wu, L.; Yu, K.; Lv, G.; Zhang, D. Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions. arXiv 2025, arXiv:2506.11111. [Google Scholar] [CrossRef]
- Lu, Y.; Li, H.; Cong, X.; Zhang, Z.; Wu, Y.; Lin, Y.; Liu, Z.; Liu, F.; Sun, M. Learning to Generate Structured Output with Schema Reinforcement Learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 4905–4918. [Google Scholar] [CrossRef]
- Marozzo, F.; Belcastro, L.; Cosentino, C.; Liò, P. Balanced and Token-Efficient Summarization of User Reviews via Stratified Sampling and Large Language Models. In Machine Learning and Knowledge Discovery in Databases. Research Track; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2026; Volume 16016, pp. 290–306. [Google Scholar] [CrossRef]
- Upadhyay, R.; Viviani, M. Enhancing Health Information Retrieval with RAG by prioritizing topical relevance and factual accuracy. Discov. Comput. 2025, 28, 27. [Google Scholar] [CrossRef]
- Chu, X.; Tan, Z.; Xue, H.; Wang, G.; Mo, T.; Li, W. Domaino1s: Guiding LLM Reasoning for Explainable Answers in High-Stakes Domains. arXiv 2025, arXiv:2501.14431. [Google Scholar] [CrossRef]
- Wiratunga, N.; Abeyratne, R.; Jayawardena, L.; Martin, K.; Massie, S.; Nkisi-Orji, I.; Weerasinghe, R.; Liret, A.; Fleisch, B. CBR-RAG: Case-Based Reasoning for Retrieval Augmented Generation in LLMs for Legal Question Answering. arXiv 2024, arXiv:2404.04302. [Google Scholar] [CrossRef]
- Belcastro, L.; Carlucci, C.; Cosentino, C.; Liò, P.; Marozzo, F. Enhancing network security using knowledge graphs and large language models for explainable threat detection. Future Gener. Comput. Syst. 2026, 176, 108160. [Google Scholar] [CrossRef]
- Blefari, F.; Cosentino, C.; Pironti, F.A.; Furfaro, A.; Marozzo, F. CyberRAG: An agentic RAG cyber attack classification and reporting tool. Future Gener. Comput. Syst. 2026, 176, 108186. [Google Scholar] [CrossRef]





| Category | Specification |
|---|---|
| Document corpus | |
| Total documents | 312 |
| Document types and source categories | 126 regulatory filings from SEC EDGAR; 58 annual reports from corporate Investor Relations material; 44 earnings-call transcripts from transcript aggregators or equivalent providers; 39 investor presentations from corporate Investor Relations websites; 45 market-news items from publicly available financial-news sources and newswire-style feeds |
| Document provenance | Versioned document identifiers, primary document type, source references when available, ingestion timestamps, text-extraction settings, content hashes, chunk identifiers, chunk offsets, and evidence-span mappings |
| Temporal coverage | Documents published between 2020 and 2026, inclusive |
| Query and annotation layer | |
| Total queries | 524 |
| Query types | 142 descriptive extraction; 118 comparative analysis; 154 numerical questions; 110 compliance-oriented questions |
| Evidence annotations | 1486 supporting spans; median of two spans per query; each span mapped to normalized chunk identifiers |
| Answer format | Structured JSON with evidence chunk IDs, citation fields, numerical fields, and validation status |
| Annotation protocol | Two independent annotators and one adjudicator; Cohen’s for evidence presence and for answer type before adjudication |
| Evaluation and release | |
| Split | Document-level split: 218 documents for development/index tuning, 47 for validation, and 47 for held-out testing |
| Redistribution policy | Full corpus not fully redistributable because of licensing constraints |
| Reproducible material | Queries, answer schemas, prompts, JSON schemas, evaluation scripts, retrieval configuration, anonymized metadata, content hashes, evidence-span offsets where permitted, and a limited redistributable subset |
| Block | Parameter | Value Used in the Main Experiments |
|---|---|---|
| Indexing | Vector store | Qdrant |
| Indexing | Dense representation | OpenAI text-embedding-3-small |
| Indexing | Sparse representation | SPLADE-style sparse vectors |
| Chunking | Chunk size/overlap | 768 tokens/128 tokens |
| Retrieval | Candidate pool | Top-100 dense candidates and top-100 sparse candidates |
| Retrieval | Fusion method | Reciprocal-rank fusion with constant 60 |
| Retrieval | Reranking and context | ColBERT-style late interaction; final top-5 chunks used as generation context |
| Generation | Prompt structure | System instructions, JSON schema, user query, retrieved evidence blocks, and citation instruction |
| Generation | Decoding settings | Temperature , top-p |
| Validation | Output control | JSON-schema validation with up to two automatic repair attempts for malformed outputs |
| LLM Backend | G (Groundedness) | J (JSON-Valid) | N (Numerical) | Latency (ms) | |
|---|---|---|---|---|---|
| gpt-3.5-turbo (baseline) | |||||
| Qwen2.5-7B-Instruct | |||||
| Llama-3.1-8B-Instruct | |||||
| gpt-4o-mini (selected) |
| Retriever | Main Signal | P@5 | Recall@5 | NDCG@5 | MAP | MRR |
|---|---|---|---|---|---|---|
| Sparse-only | Lexical | 0.54 | 0.41 | 0.60 | 0.47 | 0.62 |
| Dense-only | Semantic | 0.58 | 0.45 | 0.65 | 0.50 | 0.66 |
| Hybrid | Dense+sparse | 0.63 | 0.51 | 0.70 | 0.55 | 0.72 |
| Hybrid+Rerank | Dense+sparse+ColBERT | 0.69 | 0.55 | 0.74 | 0.61 | 0.78 |
| Condition | Indexed Content | P@5 | Recall@5 | NDCG@5 | |
|---|---|---|---|---|---|
| Raw chunks | Normalized chunk text only | 0.64 | 0.51 | 0.70 | 0.945 |
| Enriched chunks | Chunk text plus entities, tickers, sentiment, and topic labels | 0.69 | 0.55 | 0.74 | 0.953 |
| Absolute gain | Enriched—raw | +0.05 | +0.04 | +0.04 | +0.008 |
| Dimension | Check | Observed Value |
|---|---|---|
| Citation coverage | Generated insights with at least one citation | 98.7% |
| itation resolution | Citations resolvable to persisted chunk IDs and offsets | 97.8% |
| Artifact completeness | Outputs with complete retrieval scores, validation signals, and stored JSON artifacts | 99.1% |
| Pipeline traceability | Documents with complete state-transition logs from upload to ready/error states | 99.4% |
| Failure recovery | Injected recoverable failures correctly resumed without duplicated artifacts | 29/30 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Cosentino, C.; Squillace, S.; Marozzo, F. An Auditable LLM-RAG Architecture for Financial Document Intelligence and Decision Support. Future Internet 2026, 18, 284. https://doi.org/10.3390/fi18060284
Cosentino C, Squillace S, Marozzo F. An Auditable LLM-RAG Architecture for Financial Document Intelligence and Decision Support. Future Internet. 2026; 18(6):284. https://doi.org/10.3390/fi18060284
Chicago/Turabian StyleCosentino, Cristian, Simone Squillace, and Fabrizio Marozzo. 2026. "An Auditable LLM-RAG Architecture for Financial Document Intelligence and Decision Support" Future Internet 18, no. 6: 284. https://doi.org/10.3390/fi18060284
APA StyleCosentino, C., Squillace, S., & Marozzo, F. (2026). An Auditable LLM-RAG Architecture for Financial Document Intelligence and Decision Support. Future Internet, 18(6), 284. https://doi.org/10.3390/fi18060284

