A Dynamic-Selection-Based, Retrieval-Augmented Generation Framework: Enhancing Multi-Document Question-Answering for Commercial Applications
Abstract
:1. Introduction
- Consideration 1: Context Retrieval Performance. All essential context for generating accurate answers must be retrieved, as the LLM’s internal knowledge alone may be insufficient. Consequently, the retrieval process should capture as much relevant input context as possible to minimize the likelihood of generating incorrect answers.
- Consideration 2: Cost Efficiency with Commercial LLM APIs. Building and operating an on-premises LLM can be prohibitively expensive, prompting many organizations to rely on commercial LLM APIs. However, as the prompt size increases, so does the associated token cost. Therefore, strategies to minimize token usage—such as filtering out redundant or unnecessary document chunks—are essential for cost-effective operation.
- Consideration 3: Service Operation Overhead. Unlike general-purpose QA services, commercial QA applications often serve a smaller user base and thus operate on a reduced scale. Securing a large-scale infrastructure may be impractical in such cases, making it necessary to reduce system complexity and minimize the computing overhead.
2. Architecture and Design of the DS-RAG Framework
2.1. DS-RAG Framework Architecture
2.2. Methodology of the EPQD Module
2.2.1. Case Study of the Existing Method
- Issue 1: Excessive Sub-Question Generation. This refers to generating superfluous sub-questions that are unnecessary for producing an answer. In Table 1, sub-questions 1 and 2 (“Who is Charlie Chaplin?” and “Who is Bruce Bilson?”) are not needed in light of sub-questions 3 and 4. These extra sub-questions needlessly broaden the retrieval scope, increase the number of retrieved chunks, and reduce system efficiency.
- 2.
- Issue 2: Missing Key Information. In some instances, a sub-question omits essential details required to retrieve the correct chunks for generating an answer. For example, the original question in Table 2 requests the number of derivative breeds for two dog breeds, indicating that the “number of derivative breeds from the original” is a crucial piece of information. How-ever, the generated sub-question neglects this component, potentially retrieving only a partial list of derivative breeds for “German Spitz” and “Norfolk”.
- 3.
- Issue 3: Question Variations. Here, the sub-question includes content unrelated to the original inquiry. Sub-question 3 in Table 3 introduces an entirely new topic that does not appear in the original prompt, and the associated retrieved chunk may confuse the LLM’s answer generation.
2.2.2. Design of EPQD
- Segmentation Rule:
- Prompt Design:
- Consideration 1: Preserve the Original Question. Segmenting a question using conjunctions and commas effectively simplifies complex queries and facilitates the efficient generation of sub-questions by the LLM. However, excessive segmentation may result in overly brief phrases that risk losing the original question’s context and intent. To mitigate this, we generate sub-questions by incorporating each isolated phrase alongside the entire original question. This approach effectively addresses Section 2.2.1: Issues 2 and 3.
- Consideration 2: Ensure Completeness of Each Sub-Question. If sub-questions are overly fragmented, the retrieval may become excessively broad and risk omitting an essential document chunk. We thus ensure that each sub-question is grammatically intact and does not exceed the scope of the original question. This measure addresses Section 2.2.1: Issue 2.
- Consideration 3: Limit the Number of Sub-Questions. To reduce the possibility of redundant or extraneous sub-questions’ generation, we restrict the prompt to generate only one sub-question per segmented phrase. This addresses the inefficiency described in Section 2.2.1: Issue 1.
- Preserve Context. Questions in the HotpotQA dataset [24], for instance, may lose or alter the context and omit essential information when split by conjunctions or commas, resulting in overly simplistic sub-clauses, even if the original question itself is relatively straightforward. This phenomenon triggers Problems 2 and 3, so we added an extra example to the prompt to alleviate such issues.
- Remove Duplicate Entities. When generating sub-questions by referencing the preceding parts of the original question, it is possible for unnecessary retrieved chunks to be redundantly included across multiple sub-questions. This redundancy can reduce the efficiency of retrieval. To mitigate this issue, we incorporated additional examples into the prompt to minimize the inclusion of extraneous content in the generated sub-questions.
- Case Study of EPQD:
2.3. Methodology of the DICS Module
2.3.1. Deriving Selection Criteria
- Step 1. Entity Identification:
- Step 2. Sub-Question Graph Generation:
- Extract the components, excluding the entities, from the sub-question to identify the relation type.
- Use the LLM to extract subject and target entity pairs for each relation type, storing the resulting pairs as a triple set. The entity appearing first in the sub-question is designated as the source node, and the next as the target node. We also record each node’s sub-question number and a unique node ID.
- If any entity remains unpaired after Step 2, we create a new triple by connecting the remaining entity to its immediately preceding entity via a null edge.
- Step 3. Question Graph Generation:
- Step 4. Selection Criteria Extraction:
- Identifying the Core Node. Nodes with an in-degree of two or more are treated as candidate core nodes. Among these candidates, the node highest in the question hierarchy is chosen as the core node, allowing selection criteria to be identified at the most general level of the original question. Algorithm 1 provides the details of this process. The time and space complexities for identifying the core node are determined by the number of nodes N and edges E in the question graph. Since the question graph is constructed based on syntactic and semantic relationships, resulting in limited inter-node connectivity, it exhibits the characteristics of a sparse graph. Initialization and graph traversal are executed at most N + E times, and if there are multiple root nodes, a priority-based queue sorting is performed up to NlogN times. Therefore, the time complexity of Algorithm 1 is O(NlogN), and its space complexity is O(N + E).
Algorithm 1 Identifying the Core Node |
[Input] Gq: Represents the question graph, comprising the triple set as (sn, e, tn) Each node has the following attributes:
core_node: key node selected from the Gq according to the given exploration rules 1: Initialize question_word_node[ ] ← None, core_node ← None, outgoing_edges[ ] ← None 2: root_node_list[ ] ← Nodes in tripes with in-degree = 0 3: if Number of root_node_list = 1 then 4: current_node ← root_node_list.get_first ( ) 5: while current_node is not leaf node do 6: outgoing_edges.insert (get_outgoing_edges (current_node)) 7: if Number of outgoing_edges ≥ 2 then 8: core_node ← current_node 9: if current_node is question word then 10: root_node_list.insert (Nodes connected by outgoing_edges) 11: end if 12: break 13: end if 14: else current_node ← Node connected by outgoing_edges 16: end if 17: end while 18: end if 19: if number of root_node_list > 1 then 20: node_queue[ ] ← Sorted nodes of root_node_list[ ] by sub-question_number, node_ID in ascending order 21: for each node ∈ node_queue[ ] do 22: outgoing_edges.insert (get_outgoing_edges (node)) 23: if Number of outgoing_edges > 2 then 24: core_node ← node 25: if core_node is question word then 26: question_word_node.insert (core_node) 27: core_node ← None 28: else break 29: end if 30: else 31: next_nodes ← node connected by outgoing_edges 32: node_queue.append (next_nodes) 33: end if 34: end for 35: end if 36: if core_node is None then 37: core_node ← question_ word_node.get_first ( ) 38: end if 39: return core_node |
- 2.
- Extracting the Selection Criteria. After identifying the core node, we extract the selection criteria from the question graph. In the subsequent chunk selection for input context, the scope of reflection around the core node can be flexibly determined according to the range of questions intended for selection. Algorithm 2 outlines the selection criteria extraction process. The process of initializing and storing the triple set incoming to the core node is executed at most N + E times, while the repeated BFS traversals to store the triple set outgoing from the core node are performed fewer than N(N + E) times. Therefore, the time complexity of Algorithm 2 is O(N2 + NE) and the space complexity is O(N + E). Moreover, since most datasets contain at most several dozen nodes in the question graph, Algorithm 2 can be efficiently executed even in desktop-level computing environments.
Algorithm 2 Selection Criteria Extraction |
[Input] Gq, core_node [Output] criteria_list[ ]: The Selection Criteria List derived from the Question Graph (Gq). It is represented as a list of the triple set, (sn, e, tn) 1: Initialize criteria_list[ ] ← None, predecessors_list[ ] ← None 2: core_node_successors[ ] ← get_outgoing_nodes (core_node) 3: core_node_predecessors[ ] ← get_incoming_nodes (core_node) 4: for each incoming_node ∈ core_node_predecessors[ ] do 5: edge ← find_edge_between_nodes (Gq, core_node, incoming_node) 6: predecessors_list[ ] ← Add the triple (core_node, edge, incoming_node) 7: Explore all predecessors of incoming_node using Breadth-First Search (BFS) and add all triple to predecessors_list[ ] 8: end for 9: for each outgoing_node ∈ core_node_successors[ ] do 10: Create a new criteria 11: edge ← find_edge_between_nodes (Gq, core_node, outgoing_node) 12: criteria ← Add the triple (core_node, edge, outgoing_node) 13: Explore all successors of outgoing_node using Breadth-First Search (BFS) and add all triple to criteria 14: criteria ← Add all triple in predecessors_list[ ] 15: criteria_list.insert (criteria) 16: end for 17: return criteria_list[ ] |
2.3.2. Chunk Selection for Input Context
- Selection Criteria Embedding:
- Retrieved Chunk Embedding:
3. Experimental Results and Performance Evaluation
3.1. Dataset
3.1.1. Analysis of Existing Datasets
- Short and Simple Questions: The questions are brief and structurally simple, making them insufficient for effectively comparing and analyzing the performance of our proposed query decomposition (QD) scheme.
- Limited Number of Ground Truth Chunks: Each question requires no more than two ground truth chunks, which restricts the ability to compare the performance of existing methods with the DICS module proposed in this paper.
- Incomplete Inclusion of Ground Truth Chunks: There are instances where not all ground truth chunks are included in the input context, and yet the LLM is still capable of generating correct answers.
- Multiple Descriptions for a Single Answer: The questions are formulated with multiple descriptions pertaining to a single correct answer. Consequently, even if not all ground truth chunks are retrieved, modern LLMs can still generate accurate answers.
- Presence of Redundant Chunks: Many chunks within the documents allow the LLM to infer the correct answers without necessarily retrieving all ground truth chunks, leading to ambiguity and confusion in performance measurement.
3.1.2. Construction of a Custom Multi-Document QA Dataset
- Question Generation: We first developed question frameworks and then selected 41 individuals from Wikipedia. Using data related to these individuals, we employed GPT4o [23] to generate complete questions. Additionally, the corresponding documents were created using Wikipedia web pages.
- Question Types: The question types included ranking types, which inquire about the rankings of individuals, and comparison types, which ask about the similarities or differences between individuals through comparisons. For ranking-type questions, the final answer is the name of a specific individual, whereas for comparison-type questions, the answer is “yes” or “no”. Both types are designed such that the correct answer does not explicitly exist within documents, requiring the LLM to infer the answer based on input context.
- Number of Ground Truth Chunks per Question: Each question includes between two and four individuals, with everyone having one associated ground truth chunk. Additionally, the ground truth chunks are inserted multiple times within the documents to create duplicate retrieved chunks.
- Question Complexity: To enhance the complexity of the questions and the difficulty of retrieval, questions were generated via the LLM to include two or more ancillary details about specific individuals, based on the content of the documents.
3.2. Experiment on the EPQD Module
3.2.1. Experimental Setup
- Experimental Method:
- Performance Metrics:
- Entity Addition Rate (EAR): This metric calculates the proportion of entities included in the sub-questions that do not exist in the original question. A lower EAR indicates that fewer new entities are generated during the QD process, addressing Section 2.2.1: Issue 3.
- Entity Omission Rate (EOR): This metric calculates the proportion of entities present in the original question that are absent in the set of entities in the sub-questions. A lower EOR signifies that fewer entities from the original question are omitted during the QD process, addressing Section 2.2.1: Issue 2.
3.2.2. Results and Performance Evaluation
3.3. Experiment on the DICS Module
3.3.1. Experimental Setup
- Dataset:
- Training the Graph-based Selection (GS) Model:
- Benchmark Setup:
- NonQD-Reranking: This is a commonly used RAG system that does not perform QD. Instead, it retrieves documents using the original question and then conducts a reranking step between the original question and the retrieved chunks. This benchmark is designed to compare the potential improvement in retrieval performance gained through QD.
- EPQD-Reranking (SubQ): In this system, the EPQD module generates sub-questions. For each sub-question, retrieval is performed, and one chunk with the highest ranking is selected through a reranking process. These selected chunks are then aggregated to form the input context. This approach is intended to verify that applying for EPQD can address Section 1: Consideration 1.
- EPQD-Reranking (OriginalQ): Similar to the above, the EPQD module creates sub-questions, and retrieval is conducted for each sub-question. However, unlike the previous system, all retrieved chunks are subsequently reranked based on their similarity to the original question, and the top-K retrieved chunks are combined to form the input context. This system utilizes a reranker for selection and serves as a benchmark for comparing performance with the DICS module.
- Performance Metrics:
- F1-Score: Conventional metrics for assessing retrieval performance, such as HIT@k, Mean Average Precision (MAP), and Mean Reciprocal Rank (MRR), are primarily suited to measuring the effectiveness of a retriever or reranker. However, in the context of Section 1: Consideration 1 and 2, the input context generation performance hinges on how many ground truth chunks are included in the input context and how many unnecessary chunks are avoided. Therefore, we employ Precision, Recall, and their combined form, the F1-Score, as the first performance metric. The calculation formulas for Precision and Recall are as follows:
- Average Number of Chunks (ANC): This metric represents the average number of chunks contained in the input context. From the perspective of Section 1: Consideration 2, it serves as a metric for relatively comparing the LLM API usage costs.
- Parameter and Model Settings:
- Document Embedding Model: We utilized the text-embedding-ada-002 [38] model due to its favorable balance between cost and performance. Since the ground truth chunks in our dataset are relatively short, the chunk size was set to 250 and the overlap size to 50. To facilitate efficient retrieval, document embeddings were stored in a vector store provided by FAISS [5].
- Retrieval Configuration: When querying with the original question, we retrieved the top seven chunks; for sub-question queries, we retrieved the top three chunks. The LangChain framework [3] vectorstore retriever was used for both retrieval processes.
- NonQD-Reranking and EPQD-Reranking (OriginalQ): After retrieval, the top four chunks were selected via reranking. Given that each question in our dataset has two to four ground truth chunks, we set k to 4.
- EPQD-Reranking (SubQ): For each sub-question, only the top chunk was selected after reranking.
- RS-RAG: In the DICS module, representations of outgoing nodes from the core node were utilized for selection criteria embeddings in cosine similarity analysis. This approach was adopted to focus the selection process on the core elements and their surrounding information, rather than on the additional information included in the question.
- Answer Generation: GPT4o [23] was used as the LLM to generate the final answers. To ensure the LLM did not rely on its pre-trained knowledge, we designed prompts that strictly restricted such usage. Detailed prompt instructions can be found in Appendix A.3 and Appendix A.4.
3.3.2. Results and Performance Evaluation
4. Discussion and Conclusions
- Reducing LLM API Usage and Latency. Although the DS-RAG framework makes substantial efforts to minimize the cost of using an LLM API, it remains a significant burden from both cost and service latency perspectives. Meanwhile, various sizes of smaller large language models (SLLMs) are now available for on-premises deployment. If these SLLMs could be fine-tuned on the target domain data for a given application, it may become feasible to replace LLM usage entirely, enabling more practical service provision. Therefore, we plan to integrate currently available SLLMs with the DS-RAG framework to identify components that require further refinement and gradually enhance the system.
- Dataset Expansion and Diversity. We built a new dataset to analyze the performance of our framework in environments that necessitate a large number of multi-document chunks. However, real-world user questions are highly unpredictable and diverse, and our current dataset does not fully encompass this range. Moreover, although different domains have unique document characteristics, many existing datasets (including ours) rely on Wikipedia or other web data, limiting their ability to capture such domain-specific nuances. The evolving generation capabilities of LLMs offer potential solutions to this issue. Indeed, our study employed an LLM to construct a new dataset, demonstrating the feasibility of generating varied datasets through prompt engineering. In particular, if SLLMs can be fine-tuned with data from a specific domain, it may be possible to automatically generate domain-specialized QA datasets.
- Targeting the Domain. The defense domain stands out as one of the areas with the greatest need for multi-document QA applications, and yet related research is scarce. Despite the exponential growth in information collected by various intelligence, surveillance, and reconnaissance (ISR) assets and unmanned systems in field environments, many of these data go unused due to the practical impossibility of processing them all. Simple keyword-based retrieval can handle straightforward queries; however, because documents in the defense domain often contain fragmented or partial information, decision-makers require multi-document QA applications to synthesize critical insights. Future work will involve tailoring the DS-RAG framework to this domain, thereby enabling more effective information management and decision support in defense settings.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
RAG | Retrieval-Augmented Generation |
LLM | Large Language Model |
QA | Question-Answering |
QD | Question Decomposition |
EPQD | Entity-Preserving Question Decomposition |
DICS | Dynamic Input Context Selector |
DS-RAG | Dynamic-Selection-Based, Retrieval-Augmented Generation |
API | Application Programming Interface |
DPR | Dense Passage Retriever |
GS | Graph-Based Selection |
GNN | Graph Neural Network |
Appendix A
Appendix A.1
Instruction |
---|
1. Entities in all noun forms must be extracted. 2. Extracts all entities with explicitly stated meanings in sentences. 3. Extract entities as specifically as possible without duplicating. 4. All Entities should be individually meaningful, you shouldn’t extract meaningless Entities such as Be verbs 5. If a relationship is not explicitly stated, connect and extract related entities. if there is no relationship between entities, list them separately. 6. Interrogative word must should be treated as an Entity. |
Appendix A.2
Instruction |
---|
1. Relationships should be selected as an entity number corresponding to the target and subject. 2. Only the numbers are used in the triple-set. 3. All entered relationship numbers must exist at least once. |
Appendix A.3
Instruction |
---|
1. You are an assistant for question-answering tasks. 2. Use the following pieces of retrieved context to answer the question. 3. Answer using only the provided context. Do not use any background knowledge at all. 4. Provide the most accurate answer possible and respond using the full name of the subject mentioned in the question. 5. Provide only the full name, not a sentence. 6. If you don’t know the answer, say “I don’t know”. |
Appendix A.4
Instruction |
---|
1. You are an assistant for question-answering tasks. 2. Use the following pieces of retrieved context to answer the question. 3. Answer using only the provided context. Do not use any background knowledge at all. 4. Answer with only “yes” or “no” without adding a comma or period at the end. 5. If you don’t know the answer, say “I don’t know”. |
References
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv 2024, arXiv:2312.10997. [Google Scholar]
- LangChain. Available online: https://github.com/hwchase17/langchain (accessed on 9 January 2025).
- LlamaIndex. Available online: https://www.llamaindex.ai/ (accessed on 9 January 2025).
- Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 2021, 7, 535–547. [Google Scholar] [CrossRef]
- Karpukhin, V.; Oğuz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.-T. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020. [Google Scholar]
- Talmor, A.; Berant, J. The Web as a Knowledge-Base for Answering Complex Questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018. [Google Scholar]
- Perez, E.; Lewis, P.; Yih, W.-T.; Cho, K.; Kiela, D. Unsupervised Question Decomposition for Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020. [Google Scholar]
- Min, S.; Zhong, V.; Zettlemoyer, L.; Hajishirzi, H. Multi-hop Reading Comprehension through Question Decomposition and Rescoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 29–31 July 2019. [Google Scholar]
- Hasson, M.; Berant, J. Question Decomposition with Dependency Graphs. arXiv 2021, arXiv:2104.08647. [Google Scholar]
- Zhou, D.; Schärli, N.; Hou, L.; Wei, J.; Scales, N.; Wang, X.; Schuurmans, D.; Cui, C.; Bousquet, O.; Le, Q.; et al. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv 2022, arXiv:2205.10625. [Google Scholar]
- Radhakrishnan, A.; Nguyen, K.; Chen, A.; Chen, C.; Denison, C.; Hernandez, D.; Durmus, E.; Hubinger, E.; Kernion, J.; Lukošiūtė, K.; et al. Question Decomposition Improves the Faithfulness of Model-Generated Reasoning. arXiv 2023, arXiv:2307.11768. [Google Scholar]
- Glass, M.; Rossiello, G.; Chowdhury, M.F.M.; Naik, A.; Cai, P.; Gliozzo, A. Re2G: Retrieve, Rerank, Generate. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022. [Google Scholar]
- Gao, T.; Yao, S.; Chen, X.; Nair, V.; Deng, Z.; Reddy, C.; Sun, F. Cohere-RARR: Relevance-Aware Retrieval and Reranking on the Open Web. arXiv 2023, arXiv:2311.01555. [Google Scholar]
- FlagEmbedding. Available online: https://github.com/FlagOpen/FlagEmbedding (accessed on 9 January 2025).
- FlashRank. Available online: https://github.com/PrithivirajDamodaran/FlashRank (accessed on 9 January 2025).
- Pereira, J.; Fidalgo, R.; Lotufo, R.; Nogueira, R. Visconde: Multi-Document QA with GPT-3 and Neural Reranking. In Proceedings of the 45th European Conference on Information Retrieval (ECIR 2023), Dublin, Ireland, 2–6 April 2023. [Google Scholar]
- Izacard, G.; Grave, E. Leveraging Passage Retrieval with Generative Models for Open-Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Online, 19–23 April 2021. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Gao, S.; Liu, Y.; Dou, Z.; Wen, J.-R. Knowledge Graph Prompting for Multi-Document Question Answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
- Dong, J.; Fatemi, B.; Perozzi, B.; Yang, L.F.; Tsitsulin, A. Don’t Forget to Connect! Improving RAG with Graph-based Reranking. arXiv 2024, arXiv:2405.18414. [Google Scholar]
- He, X.; Tian, Y.; Sun, Y.; Chawla, N.; Laurent, T.; LeCun, Y.; Bresson, X.; Hooi, B. G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering. In Proceedings of the 2024 Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
- OpenAI. ChatGPT. Available online: https://chat.openai.com (accessed on 9 January 2025).
- Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Online, 6–12 December 2020. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- spaCy. Available online: https://spacy.io/ (accessed on 1 September 2024).
- Chen, Y.; Zheng, Y.; Yang, Z. Prompt-Based Metric Learning for Few-Shot NER. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
- He, K.; Mao, R.; Huang, Y.; Gong, T.; Li, C.; Cambria, E. Template-Free Prompting for Few-Shot Named Entity Recognition via Semantic-Enhanced Contrastive Learning. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 18357–18369. [Google Scholar] [CrossRef] [PubMed]
- Ashok, D.; Lipton, Z.C. PromptNER: Prompting for Named Entity Recognition. arXiv 2023, arXiv:2305.15444. [Google Scholar]
- Tang, Y.; Hasan, R.; Runkler, T. FsPONER: Few-Shot Prompt Optimization for Named Entity Recognition in Domain-Specific Scenarios. In Proceedings of the ECAI 2024, Santiago de Compostela, Spain, 19–24 October 2024. [Google Scholar]
- Liu, J.; Fei, H.; Li, F.; Li, J.; Li, B.; Zhao, L.; Teng, C.; Ji, D. TKDP: Threefold Knowledge-enriched Deep Prompt Tuning for Few-shot Named Entity Recognition. IEEE Trans. Knowl. Data Eng. 2024, 36, 6397–6409. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
- Xue, L.; Zhang, D.; Dong, Y.; Tang, J. AutoRE: Document-Level Relation Extraction with Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 11–16 August 2024. [Google Scholar]
- Li, X.; Chen, K.; Long, Y.; Zhang, M. LLM with Relation Classifier for Document-Level Relation Extraction. arXiv 2024, arXiv:2408.13889. [Google Scholar]
- Wadhwa, S.; Amir, S.; Wallace, B. Revisiting Relation Extraction in the era of Large Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
- Tang, Y.; Yang, Y. Multihop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries. In Proceedings of COLM, Philadelphia, PA, USA, 7 October 2024. [Google Scholar]
- OpenAI. Embeddings. Available online: https://platform.openai.com/docs/guides/embeddings (accessed on 9 January 2025).
- Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Ferguson, J.; Gardner, M.; Hajishirzi, H.; Khot, T.; Dasigi, P. IIRC: A Dataset of Incomplete Information Reading Comprehension Questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020. [Google Scholar]
- Geva, M.; Khashabi, D.; Segal, E.; Khot, T.; Roth, D.; Berant, J. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Trans. Assoc. Comput. Linguist. 2020, 9, 346–361. [Google Scholar] [CrossRef]
- DS-RAG Framework. Available online: https://github.com/Mulsanne2/DS-RAG (accessed on 16 January 2025).
- van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021. [Google Scholar]
- Biten, A.F.; Tito, R.; Mafla, A.; Gomez, L.; Rusiñol, M.; Jawahar, C.V. Scene Text Visual Question Answering. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Levenshtein, V.I. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Sov. Phys. Dokl. 1966, 10, 707–710. [Google Scholar]
Original Question | Decomposed Sub-Questions |
---|---|
Who was considered more iconic, Charlie Chaplin or Bruce Bilson? |
|
Original Question | Decomposed Sub-Questions |
---|---|
Which Spanieldog breed has more derivative breeds from the original, German Spitz or Norfolk? |
|
Original Question | Decomposed Sub-Questions |
---|---|
In what country are Ugni and Stenomesson native plants? |
|
Instruction |
---|
|
LangChain | EPQD |
---|---|
Original Question: Who was considered more iconic, Charlie Chaplin or Bruce Bilson? | |
|
|
Original Question: Which Spanieldog breed has more derivative breeds from the original, German Spitz or Norfolk? | |
|
|
Original Question: In what country are Ugni and Stenomesson native plants? | |
|
|
Question Type | Number of Individuals in Question | Number of Questions | Percentage (%) |
---|---|---|---|
Ranking | Train-3 | 1500 | 37.5 |
Test-2 | 125 | 3.125 | |
Test-3 | 250 | 6.25 | |
Test-4 | 125 | 3.125 | |
Comparison | Train-3 | 1500 | 37.5 |
Test-2 | 125 | 3.125 | |
Test-3 | 250 | 6.25 | |
Test-4 | 125 | 3.125 |
Module | EAP (%) | EOP (%) | |
---|---|---|---|
LangChain | Exp. 1 | 48.67 | 23.65 |
Exp. 2 | 47.32 | 70.21 | |
EPQD | Exp. 1 | 20.29 | 20.35 |
Exp. 2 | 14.38 | 7.38 |
Precision | Recall | F1-Score | ANLS | ASC | ||
---|---|---|---|---|---|---|
Ranking Type | NonQD-Reranking | 0.0864 | 0.1093 | 0.0965 | 0.052 | 4 |
EPQD-Reranking (SubQ) | 0.3978 | 1 | 0.5692 | 0.8812 | 8.584 | |
EPQD-Reranking (OrigianlQ) | 0.2873 | 0.3859 | 0.3294 | 0.2580 | 4 | |
DS-RAG | 0.7260 | 0.9729 | 0.8315 | 0.8111 | 3.954 | |
Comparison Type | NonQD-Reranking | 0.1810 | 0.2030 | 0.1914 | 0.008 | 4 |
EPQD-Reranking (SubQ) | 0.4575 | 0.981 | 0.6234 | 0.934 | 7.456 | |
EPQD-Reranking (OrigianlQ) | 0.3478 | 0.4732 | 0.4009 | 0.0960 | 4 | |
DS-RAG | 0.9207 | 0.9247 | 0.9227 | 0.866 | 3.008 |
Model | Number of Parameters |
---|---|
BGE-Reranker-v2-m3 | 567,754,752 |
Graph-based Selection (GS) | 174,729,216 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kwon, M.; Bang, J.; Hwang, S.; Jang, J.; Lee, W. A Dynamic-Selection-Based, Retrieval-Augmented Generation Framework: Enhancing Multi-Document Question-Answering for Commercial Applications. Electronics 2025, 14, 659. https://doi.org/10.3390/electronics14040659
Kwon M, Bang J, Hwang S, Jang J, Lee W. A Dynamic-Selection-Based, Retrieval-Augmented Generation Framework: Enhancing Multi-Document Question-Answering for Commercial Applications. Electronics. 2025; 14(4):659. https://doi.org/10.3390/electronics14040659
Chicago/Turabian StyleKwon, Mincheol, Jimin Bang, Seyoung Hwang, Junghoon Jang, and Woosin Lee. 2025. "A Dynamic-Selection-Based, Retrieval-Augmented Generation Framework: Enhancing Multi-Document Question-Answering for Commercial Applications" Electronics 14, no. 4: 659. https://doi.org/10.3390/electronics14040659
APA StyleKwon, M., Bang, J., Hwang, S., Jang, J., & Lee, W. (2025). A Dynamic-Selection-Based, Retrieval-Augmented Generation Framework: Enhancing Multi-Document Question-Answering for Commercial Applications. Electronics, 14(4), 659. https://doi.org/10.3390/electronics14040659