Author Contributions
Conceptualization, Y.J. and Q.D.; methodology, Q.D.; validation, Y.J. and Q.D.; formal analysis, X.W.; investigation, Y.J.; resources, Q.D. and Y.J.; data curation, Y.J.; writing—original draft preparation, Y.J. and Q.D.; writing—review and editing, Y.J. and Q.D.; visualization, Y.J.; supervision, X.W.; project administration, X.W.; funding acquisition, X.W. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the College Students’ Innovation and Entrepreneurship Training Program, Grant Number: 202610004005.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data presented in this study are derived from publicly available resources. FinQA is available through the T
2-RAGBench benchmark at
https://huggingface.co/datasets/G4KMU/t2-ragbench (accessed on 14 April 2026), and the FinanceBench open-source subset is available from
https://github.com/patronus-ai/financebench/tree/main# (accessed on 14 April 2026). The retrieval-oriented processed data generated in this study, including derived query–chunk relevance labels and evidence-to-chunk mappings, are available from the corresponding author upon reasonable request.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| BM25 | Best Matching 25 |
| BO | Bayesian Optimization |
| MRR | Mean Reciprocal Rank |
| nDCG | Normalized Discounted Cumulative Gain |
| QA | Question Answering |
| FinQA | Financial Question Answering |
| TPE | Tree-structured Parzen Estimator |
| RAG | Retrieval-Augmented Generation |
References
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
- Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M.-W. Retrieval Augmented Language Model Pre-Training. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020; Volume 119, pp. 3929–3938. [Google Scholar]
- Chen, Z.; Chen, W.; Smiley, C.; Shah, S.; Borova, I.; Langdon, D.; Moussa, R.; Beane, M.; Huang, T.-H.; Routledge, B.; et al. FinQA: A Dataset of Numerical Reasoning over Financial Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 3697–3711. [Google Scholar]
- Islam, P.; Kannappan, A.; Kiela, D.; Qian, R.; Scherrer, N.; Vidgen, B. FinanceBench: A New Benchmark for Financial Question Answering. arXiv 2023, arXiv:2311.11944. [Google Scholar] [CrossRef]
- Strich, J.; Isgorur, E.K.; Trescher, M.; Biemann, C.; Semmann, M. T2-RAGBench: Text-and-Table Benchmark for Evaluating Retrieval-Augmented Generation. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Rabat, Morocco; Association for Computational Linguistics: Stroudsburg, PA, USA, 2026; pp. 165–191. [Google Scholar]
- Reddy, V.; Koncel-Kedziorski, R.; Lai, V.D.; Krumdick, M.; Lovering, C.; Tanner, C. DocFinQA: A Long-Context Financial Reasoning Dataset. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Bangkok, Thailand; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 445–458. [Google Scholar]
- Kim, S.; Song, H.; Seo, H.; Kim, H. Optimizing Retrieval Strategies for Financial Question Answering Documents in Retrieval-Augmented Generation Systems. arXiv 2025, arXiv:2503.15191. [Google Scholar] [CrossRef]
- Lee, J.; Roh, M. Multi-Reranker: Maximizing Performance of Retrieval-Augmented Generation in the FinanceRAG Challenge. arXiv 2024, arXiv:2411.16732. [Google Scholar]
- Choe, J.; Kim, J.; Jung, W. Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents. In Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 16663–16681. [Google Scholar]
- Izacard, G.; Grave, E. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume; Online; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 874–880. [Google Scholar]
- Orbach, M.; Eytan, O.; Sznajder, B.; Gera, A.; Boni, O.; Kantor, Y.; Bloch, G.; Levy, O.; Abraham, H.; Barzilay, N.; et al. An Analysis of Hyper-Parameter Optimization Methods for Retrieval Augmented Generation. arXiv 2025, arXiv:2505.03452. [Google Scholar] [CrossRef]
- Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for Hyper-Parameter Optimization. Adv. Neural Inf. Process. Syst. 2011, 24, 2546–2554. [Google Scholar]
- Snoek, J.; Larochelle, H.; Adams, R.P. Practical Bayesian Optimization of Machine Learning Algorithms. Adv. Neural Inf. Process. Syst. 2012, 25, 2951–2959. [Google Scholar]
- Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
- Jimeno Yepes, A.; You, Y.; Milczek, J.; Laverde, S.; Li, R. Financial Report Chunking for Effective Retrieval Augmented Generation. arXiv 2024, arXiv:2402.05131. [Google Scholar] [CrossRef]
- Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.-t. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6769–6781. [Google Scholar]
- Robertson, S.E.; Walker, S. Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. In SIGIR ’94; Springer: London, UK, 1994; pp. 232–241. [Google Scholar]
- Hsu, H.-L.; Tzeng, J. DAT: Dynamic Alpha Tuning for Hybrid Retrieval in Retrieval-Augmented Generation. arXiv 2025, arXiv:2503.23013. [Google Scholar] [CrossRef]
- Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-Generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA; ACM: New York, NY, USA, 2019; pp. 2623–2631. [Google Scholar]
- Järvelin, K.; Kekäläinen, J. Cumulated Gain-Based Evaluation of IR Techniques. ACM Trans. Inf. Syst. 2002, 20, 422–446. [Google Scholar] [CrossRef]
- Voorhees, E.M.; Tice, D.M. The TREC-8 Question Answering Track. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), Athens, Greece; European Language Resources Association (ELRA): Luxembourg, 2000. [Google Scholar]
Figure 1.
FinQA data processing pipeline for retrieval experiments.
Figure 2.
Retrieval architecture comparison on the FinQA dev set: (a) Recall@k trends of Dense, BM25, and Hybrid Retrieval; (b) grouped comparison of key retrieval metrics.
Figure 3.
Development-set effectiveness of different optimization methods under the same search space. Note: gray, blue, and red denote Grid Search, Random Search, and Bayesian Optimization, respectively.
Figure 4.
Search efficiency and convergence behavior under limited optimization budgets.
Figure 5.
Transfer from development-selected configurations to held-out test performance.
Figure 6.
Performance trends across individual hyperparameters: (a) chunk size vs. objective; (b) overlap vs. objective; (c) fusion alpha vs. objective. The green arrow/annotation marks the highlighted parameter setting in each subplot.
Figure 7.
Mean objective across chunk size and overlap.
Figure 8.
Stable high-performance regions on the development and test sets.
Figure 9.
Statistical reliability analysis of representative configurations on the test set: (a) Paired per-query differences; (b) Bootstrap distribution.
Figure 10.
Representative configuration transfer on FinanceBench.
Table 1.
Manual relevance rates by label source on FinQA.
| Validation Item | Value |
|---|
| Answer match | 100.0% |
| Boundary negative | 57.6% |
| Random negative | 27.6% |
| Retrieved topk | 26.7% |
| Weak positive | 72.2% |
Table 2.
Manual validation results for evidence-to-chunk mappings on the FinanceBench open-source subset.
| Validation Item | Value |
|---|
| Sampled pairs | 80 |
| Positive mapping accuracy | 90.0% |
| Negative mapping accuracy | 85.0% |
| Overall agreement | 87.5% |
| Cohen’s κ | 0.75 |
Table 3.
Different architecture results.
| Architecture | Recall@5 | Precision@5 | nDCG@5 | Recall@10 | Precision@10 | nDCG@10 | MRR |
|---|
| Dense | 0.5119 | 0.1574 | 0.3738 | 0.6474 | 0.1061 | 0.4271 | 0.4104 |
| BM25 | 0.5812 | 0.1796 | 0.4206 | 0.7069 | 0.1185 | 0.4721 | 0.4466 |
| Hybrid | 0.6012 | 0.1844 | 0.4386 | 0.7342 | 0.1229 | 0.4937 | 0.4769 |
Table 4.
Per-seed comparison of optimization methods on the FinQA dev set.
| Method | Seed | Budget | Best Trial | Best Dev Obj. | Best Config (cs, ov, α) | Runtime to Best (s) | Test Obj. | Dev → Test Drop |
|---|
| Grid | — | 60 | 52 | 0.5167 | (1000, 50, 0.5) | 8409.7 | 0.4641 | −0.0526 |
| Random | 42 | 20 | 8 | 0.5014 | (800, 0, 0.7) | 316.6 | 0.453 | −0.0484 |
| Random | 123 | 20 | 0 | 0.5027 | (1000, 50, 0.7) | 27 | 0.4583 | −0.0444 |
| Random | 7 | 20 | 5 | 0.5027 | (1000, 50, 0.7) | 238.2 | 0.4583 | −0.0444 |
| Random | 2024 | 20 | 10 | 0.5027 | (1000, 50, 0.7) | 422.1 | 0.4583 | −0.0444 |
| Random | 3407 | 20 | 7 | 0.5167 | (1000, 50, 0.5) | 744.9 | 0.4641 | −0.0526 |
| BO | 42 | 20 | 7 | 0.5014 | (800, 0, 0.7) | 288.9 | 0.453 | −0.0484 |
| BO | 123 | 20 | 1 | 0.5036 | (1000, 100, 0.5) | 92.5 | 0.4477 | −0.0559 |
| BO | 7 | 20 | 10 | 0.5167 | (1000, 50, 0.5) | 992.8 | 0.4641 | −0.0526 |
| BO | 2024 | 20 | 2 | 0.5014 | (800, 0, 0.7) | 116.4 | 0.453 | −0.0484 |
| BO | 3407 | 20 | 0 | 0.5167 | (1000, 50, 0.5) | 24.2 | 0.4641 | −0.0526 |
Table 5.
Method-level summary statistics for dev set effectiveness under the low-budget setting.
| Method | Best Dev Obj. (Mean ± Std) [Min–Max] | Best Config (Mode) | Runtime to Best (s) (Mean ± Std) [Min–Max] | Test Obj. (Mean ± Std) [Min–Max] | Dev → Test Drop (Mean ± Std) | Gap to Grid Best (Mean ± Std) | Hit Rate |
|---|
| Grid | 0.5167 [single run] | (1000, 50, 0.5) | 8409.7 [single run] | 0.4641 [single run] | −0.0526 [single run] | 0.0000 | Exhaustive reference |
| Random | 0.5052 ± 0.0064 [0.5014–0.5167] | (1000, 50, 0.7), 3/5 seeds | 349.8 ± 264.1 [27.0–744.9] | 0.4584 ± 0.0039 [0.4530–0.4641] | −0.0468 ± 0.0037 | 0.0115 ± 0.0064 | 1/5 |
| BO | 0.5080 ± 0.0080 [0.5014–0.5167] | tied: (800, 0, 0.7) and (1000, 50, 0.5), each 2/5 seeds | 303.0 ± 397.8 [24.2–992.8] | 0.4564 ± 0.0074 [0.4477–0.4641] | −0.0516 ± 0.0032 | 0.0087 ± 0.0080 | 2/5 |
Table 6.
Development-to-test comparison of representative high-performing configurations.
| Configuration | Dev Objective | Test Objective | Drop |
|---|
| cs = 1000, ov = 50, = 0.5 | 0.5167 | 0.4641 | 0.0526 |
| cs = 1000, ov = 100, = 0.5 | 0.5036 | 0.4477 | 0.0559 |
| cs = 1000, ov = 50, = 0.7 | 0.5027 | 0.4583 | 0.0444 |
| cs = 800, ov = 0, = 0.7 | 0.5014 | 0.453 | 0.0484 |
Table 7.
Architecture-level statistical comparison between BM25 and baseline Hybrid on the FinQA test set.
| Comparison | Mean BM25 nDCG@5 | Mean Hybrid nDCG@5 | Delta (Hybrid − BM25) | 95% Bootstrap CI | Permutation p-Value | Mean BM25 Recall@5 | Mean Hybrid Recall@5 |
|---|
| Baseline Hybrid vs. BM25 | 0.4115 | 0.4093 | −0.0023 | [−0.0181, 0.0128] | 0.7763 | 0.5634 | 0.5641 |
Table 8.
Bootstrap pairwise comparison results for representative configurations.
| Comparison | Mean1 | Mean2 | Delta | 95% CI | Raw p | Holm p |
|---|
| A vs. B | 0.4376 | 0.4326 | 0.005 | [−0.0072, 0.0174] | 0.4296 | 0.6087 |
| A vs. C | 0.4376 | 0.4273 | 0.0103 | [−0.0089, 0.0297] | 0.3044 | 0.6087 |
| A vs. D | 0.4376 | 0.4219 | 0.0157 | [0.0001, 0.0309] | 0.0459 | 0.1377 |
Table 9.
Architecture-level validation results on FinanceBench under the baseline setting.
| Architecture | Recall@5 | MRR | nDCG@5 | Precision@5 |
|---|
| Dense | 0.0865 | 0.1774 | 0.0951 | 0.0720 |
| Hybrid | 0.0684 | 0.1308 | 0.0740 | 0.0547 |
| BM25 | 0.0384 | 0.0564 | 0.0423 | 0.0280 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |