Next Article in Journal
A Tabular Data Imputation Technique Using Transformer and Convolutional Neural Networks
Next Article in Special Issue
CBR2: A Case-Based Reasoning Framework with Dual Retrieval Guidance for Few-Shot KBQA
Previous Article in Journal
Identifying New Promising Research Directions with Open Peer Reviews and Contextual Top2Vec
Previous Article in Special Issue
Subjective Evaluation of Operator Responses for Mobile Defect Identification in Remanufacturing: Application of NLP and Disagreement Tagging
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Systematic Review

A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges

1
Advanced Research and Engineering Centre (ARC), Queen’s University Belfast, Belfast BT7 1NN, UK
2
Institute of Computing (IoC), Kohat University of Science & Technology (KUST), Kohat 26000, Pakistan
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2025, 9(12), 320; https://doi.org/10.3390/bdcc9120320
Submission received: 10 September 2025 / Revised: 18 November 2025 / Accepted: 25 November 2025 / Published: 12 December 2025
(This article belongs to the Special Issue Artificial Intelligence (AI) and Natural Language Processing (NLP))

Abstract

Background: Retrieval-augmented generation (RAG) aims to reduce hallucinations and outdated knowledge by grounding LLM outputs in retrieved evidence, but empirical results are scattered across tasks, systems, and metrics, limiting cumulative insight. Objective: We aimed to synthesise empirical evidence on RAG effectiveness versus parametric-only baselines, map datasets/architectures/evaluation practices, and surface limitations and research gaps. Methods: This systematic review was conducted and reported in accordance with PRISMA 2020. We searched the ACM Digital Library, IEEE Xplore, Scopus, ScienceDirect, and DBLP; all sources were last searched on 13 May 2025. This included studies from January 2020–May 2025 that addressed RAG or similar retrieval-supported systems producing text output, met citation thresholds (≥15 for 2025; ≥30 for 2024 or earlier), and offered original contributions; excluded non-English items, irrelevant works, duplicates, and records without accessible full text. Bias was appraised with a brief checklist; screening used one reviewer with an independent check and discussion. LLM suggestions were advisory only; 2025 citation thresholds were adjusted to limit citation-lag. We used a descriptive approach to synthesise the results, organising studies by themes aligned to RQ1–RQ4 and reporting summary counts/frequencies; no meta-analysis was undertaken due to heterogeneity of designs and metrics. Results: We included 128 studies spanning knowledge-intensive tasks (35/128; 27.3%), open-domain QA (20/128; 15.6%), software engineering (13/128; 10.2%), and medical domains (11/128; 8.6%). Methods have shifted from DPR + seq2seq baselines to modular, policy-driven RAG with hybrid/structure-aware retrieval, uncertainty-triggered loops, memory, and emerging multimodality. Evaluation remains overlap-heavy (EM/ F 1 ), with increasing use of retrieval diagnostics (e.g., Recall@k, MRR@k), human judgements, and LLM-as-judge protocols. Efficiency and security (poisoning, leakage, jailbreaks) are growing concerns. Discussion: Evidence supports a shift to modular, policy-driven RAG, combining hybrid/structure-aware retrieval, uncertainty-aware control, memory, and multimodality, to improve grounding and efficiency. To advance from prototypes to dependable systems, we recommend: (i) holistic benchmarks pairing quality with cost/latency and safety, (ii) budget-aware retrieval/tool-use policies, and (iii) provenance-aware pipelines that expose uncertainty and deliver traceable evidence. We note the evidence base may be affected by citation-lag from the inclusion thresholds and by English-only, five-library coverage. Funding: Advanced Research and Engineering Centre. Registration: Not registered.

1. Introduction

Over the past five years, Large Language Models (LLMs) have transformed how researchers and practitioners process text. Retrieval-Augmented Generation (RAG) addresses key shortcomings of these models—such as hallucinated facts, outdated world knowledge, and the challenges of knowledge-intensive or domain-specific queries—by enabling a generative model to query an external corpus at inference time. This approach combines parametric memory learned during pre-training with non-parametric evidence retrieved on demand [1].
Traditional retrieval systems can locate relevant passages but cannot compose new text, while purely generative models produce fluent language yet risk factual errors when external knowledge is required. RAG integrates both paradigms, offering factual grounding without sacrificing fluency.
Despite rapid innovation, empirical results on RAG effectiveness remain fragmented across domains, datasets, and evaluation practices. Without systematic synthesis, progress risks duplication and inconsistent benchmarking. To address this gap, we conduct a PRISMA 2020 guided, citation-weighted systematic review of 128 studies published between 2020 and May 2025. The records were retrieved from the ACM Digital Library, IEEE Xplore, Scopus, ScienceDirect, and DBLP. To mitigate citation-lag bias, we applied a lower citation-count threshold to 2025 publications; full details of the search strategy, screening, and inclusion/exclusion criteria are provided in Section 3.
This review catalogues datasets, architectures, and evaluation practices, and synthesises empirical evidence on the effectiveness and limitations of RAG. It is intended for both NLP researchers, who may use it to identify knowledge gaps and future directions, and NLP engineers seeking practical guidance for deploying RAG in applied settings. By consolidating methods, metrics, and deployment challenges, we aim to advance the development of robust and scalable retrieval-augmented systems.
Our objectives are to (i) synthesise empirical evidence comparing RAG to parametric-only baselines, (ii) map datasets, architectures, and evaluation practices, and (iii) surface limitations and research gaps relevant to dependable RAG systems. To structure this review, we formulate four guiding research questions (Table 1) that collectively define the scope and analytical focus of the study.
The remainder of this paper is organised as follows. Section 2 surveys previous work across five themes, mapping them to RQ1–RQ4 and positioning our PRISMA-guided, citation-weighted synthesis within existing taxonomies and trends. Section 3 then describes the materials and methods, including search strategies, screening procedures, and inclusion/exclusion criteria, following the PRISMA 2020 framework. Section 4 presents the results of the review, categorising the included studies according to the four research questions (RQ1–RQ4) and summarising key trends and findings. Section 5 discusses the implications of these findings, highlighting methodological patterns, limitations, and directions for future research. Finally, Section 7 concludes by summarising the main findings and offering recommendations for the design, evaluation, and deployment of retrieval-augmented generation systems. These sections address the guiding research questions by providing a transparent, PRISMA-aligned synthesis of the current state of knowledge on RAG.

2. Related Work

We adopt the RAG formulation of Lewis et al. [1] as our reference point and characterise subsequent developments relative to it. Guided by our PRISMA protocol and employing a reduced citation threshold for 2025 to mitigate citation lag, we organise related work into five themes: foundations and taxonomies; retrieval and indexing; integration and pipelines; structured and multimodal knowledge; and evaluation and trustworthiness. Throughout, we map contributions to RQ1–RQ4 and highlight points that inform the analyses that follow.
Various surveys in this area have mapped what to retrieve, how to retrieve it, and how to fuse non-parametric evidence with generators across tasks such as QA, dialogue, Machine Translation, summarisation, code, and beyond [2,3,4,5,6]. A recurring distinction is made between baseline RAG (often single-pass) and more advanced or modular pipelines, with multiple surveys and empirical results indicating that retrieval choice and context curation substantially affect downstream performance [3,5]. Retrieval-augmented large language model-focused and knowledge-centric accounts further refine these taxonomies by integration layer (input/intermediate/output), training regime (training-free vs. joint), and knowledge-selection and alignment strategies [7,8]. Domain syntheses in legal technology and business information extraction illustrate application-specific pipelines and, importantly, surface fragmented evaluation practices that complicate cross-paper comparison [9,10]. These surveys establish scope and terminology and motivate closer attention to evaluation comparability.
From an information-retrieval perspective, RAG pipelines comprise pre-retrieval (indexing and query formulation), retrieval (search and ranking), and post-retrieval (re-ranking and filtering) stages [11]. Methods span sparse, dense and hybrid retrievers; chunking strategies and approximate nearest neighbour (ANN) indexing trade off efficiency against coverage, and overall effectiveness is strongly associated with retrieval quality [2,5,7,12]. Recent work also emphasises contextual compression—semantic or prompt compression, efficient attention, and retriever-based compressors—to reduce irrelevant tokens while preserving essential evidence, balancing compression ratio against fidelity [13]. In practice, these choices shape not only effectiveness but also index size, latency budgets, and serving costs.
Beyond input-level concatenation [1], integration includes query-level (input), latent (intermediate), logit-level, and speculative schemes; attention-based fusion, data augmentation, skeleton-editing, and post-editing have been widely explored [2,5,14]. Building on these, iterative and adaptive retrieval loops, routing, and modular augmentation report improvements over single-pass pipelines on QA and other knowledge-intensive tasks [3,11,12,14]. Work on “agentic” RAG introduces reflection, planning, tool use, and multi-agent coordination; studies note gains in context integration and reasoning, alongside orchestration complexity and evaluation challenges [15]. Overall, emphasis is shifting from a single retrieval step to policy-driven control over whether, when, and what to retrieve.
GraphRAG formalises a three-stage workflow that includes graph indexing; graph-guided retrieval over nodes, triples, paths, and subgraphs; and graph-enhanced generation. This workflow aims to capture relational structure that flat-text retrieval can miss [16]. Complementary accounts identify subgraph selection as a core bottleneck, and most evaluations remain QA-style [17]. A recent GraphRAG survey details knowledge organisation taxonomies, retriever families (similarity, logical, GNN-based, LLM-based, RL-based) and integration granularities, reporting token savings and emerging tooling ecosystems [18]. In parallel, multimodal RAG extends retrieval and fusion to images, video, audio and documents, cataloguing cross-modal methods and metrics [19]; a vision-centric review covers recognition, visual QA, video understanding, generation, and embodied planning, noting latency and alignment constraints in practical systems [20]. These strands broaden RAG beyond flat text and introduce alignment and latency constraints that interact with indexing choices.
Recent reviews and surveys consolidate evaluation along three axes: relevance of retrieved context, correctness and faithfulness of answers, and citation quality. Common measures include retrieval metrics (e.g., Recall@k, MRR@k), reference-based scores, and LLM-as-judge protocols. Datasets range from open-domain QA to synthetic, reference-free protocols (e.g., RAGAS, ARES), with growing attention to latency and robustness [21,22]. Trust-orientated accounts organise risks around factuality, robustness, fairness, transparency, accountability, and privacy, highlighting retrieval-specific failure modes that stem from conflicting evidence, noisy or poisoned corpora, and prompt injection, and which call for unified standards for robustness, privacy, and fairness benchmarks tailored to RAG [23,24]. Notably, retrieval-stage gains do not always translate to better generated answers, underscoring the need for joint, pipeline-level evaluation.
Across these strands, improvements over the baseline [1] focus on stronger retrieval (hybrid, structured, multimodal), richer integration (latent, logit-level, speculative), and adaptive pipelines (iterative or agentic designs). In parallel, evaluation and trustworthiness have emerged as co-equal design axes alongside quality and cost, shaping how RAG systems are compared, stress-tested, and deployed in practice [2,11,14,22,24].

3. Methodology

This review was conducted in accordance with the PRISMA 2020 guidelines [25]; Figure 1 illustrates the study selection process.

3.1. Systematic Review Framework Selection

The PRISMA 2020 guidelines provide an extensive framework for systematic reviews, especially suitable for multidisciplinary fields such as RAG. These guidelines emphasise updated methodological standards, including the synthesis of findings, the assessment of study biases, and the inclusion of various study designs. In contrast, Kitchenham’s guidelines [26], designed specifically for software engineering literature reviews, do not offer the necessary interdisciplinary breadth required for RAG research. Similarly, Evidence-Based Software Engineering (EBSE) [26] focuses primarily on the application of evidence-based principles to software engineering and does not adequately address the broader theoretical and application-based questions relevant to RAG. Therefore, PRISMA 2020 is valuable for facilitating the synthesis of various study methodologies and goals, which aligns well with the evolving and interdisciplinary nature of RAG research.

3.2. Database Selection

We used four major databases and the DBLP bibliographic index to ensure broad coverage and deduplication. These five sources were selected for their comprehensive repositories and relevance to RAG research:

3.3. Inclusion and Exclusion Criteria

We applied eligibility criteria to ensure the review captured recent and influential RAG research (2020–2025). This period begins with the introduction of the Meta AI framework [1]. A key milestone in natural language processing (NLP) research. Our selection includes studies that explicitly address the RAG framework or explore systems with similar functionalities, ensuring that our review comprehensively captures the latest innovations in this area.
  • Inclusion Criteria:
  • Focus: Studies must address RAG or similar systems that rely on retrieval to support text output.
  • Publication Date and Citations: Only works from January 2020 to May 2025 are accepted. For 2025 publications, a minimum of 15 citations is required; for those from 2024 or earlier, at least 30 citations are needed.
  • Original Contributions: Only works that present new experimental data or fresh ideas are considered.
  • Input and Output: Studies may use various input types (e.g., text, images, audio) if retrieval is central, but the final output must be text.
  • Exclusion Criteria:
  • Relevance: Works that do not pertain to the topic are removed.
  • Language: Studies not published in English are excluded.
  • Duplicates and Access: Duplicate works or those with unavailable full text are omitted.

3.4. Search Strategy and Terms

We based our search terms on the core concept of the RAG framework by breaking down “retrieval augmented generation” into three parts: “retrieval”, “augmented”, and “generation”. These parts became the basis for our search terms used in titles, abstracts, and keywords. Table 2 lists the detailed queries for each database. Our systematic approach, combining the main keywords with related phrases such as “retrieval augmented text generation”, gathered a wide range of relevant literature on RAG.

3.5. Search Process

We queried five well-established resources—the ACM Digital Library, IEEE Xplore, Scopus, ScienceDirect, and the DBLP bibliographic index—to collect relevant articles. All sources were last searched on 13 May 2025. Citation counts used to evaluate eligibility thresholds were collected on 13–14 May 2025. Results were exported in BibTeX, CSV, or Excel formats, as provided by the sources. A Python 3.12 script converted BibTeX to Excel and consolidated titles, abstracts, publication years, authors, author counts, and venues into a single table. Duplicates were automatically removed by the script and then verified manually. All scripts used for deduplication and export will be made available during peer review and prior to publication.

3.6. Screening and Study Selection

Articles were screened against a set inclusion and exclusion criteria linked to our research questions. Missing abstracts were retrieved from the original databases and manually added. Following PRISMA guidelines, a reviewer handled initial screening, full text review, and data extraction, while a second reviewer independently checked the results to reduce bias. This dual-review method strengthens the review’s reliability. The process, illustrated in Figure 1, consisted of an initial screening and a review of the full text.

3.6.1. Initial Screening

After removing duplicates and applying date and citation filters, two of the present authors (R1, R2) independently screened all titles and abstracts ( n = 202 ). Each record was labelled 1 (include) or 0 (exclude) against the predefined eligibility criteria (Section 3.3). To aid, but not replace, human judgement, we provided both reviewers with LLM-generated suggestions from deepseek-ai/DeepSeek-R1-Distill-Llama-70B; final decisions remained entirely with the reviewers. The agreement between the authors on independent pre-adjudication labels was calculated at the record level using Cohen’s κ : observed agreement P o = 0.946 , expected agreement P e = 0.573 , yielding κ = 0.873 . The 2 × 2 contingency table is provided in Appendix A, Table A1. Disagreements were resolved through discussion.

3.6.2. Full-Text Screening

Full texts were retrieved from the original sources indexed by our selected databases and the DBLP bibliographic index. During full-text screening, we applied a quality assurance protocol assessing soundness, validity, reliability, and statistical rigour to ensure the inclusion of only high-quality studies. During the screening, each article was evaluated against predefined criteria for inclusion and exclusion, which encompassed the scope and methodological robustness of the study, and was categorised with a ‘0’ for exclusion or ‘1’ for inclusion. Moreover, we encountered challenges concerning the interchangeable use of terms such as RAG, retriever + reader models, and retrieval-augmented LLMs. To address these challenges, we concentrated on clearly differentiating the retriever and generator components, thereby streamlining the analysis while ensuring a comprehensive comprehension of the fundamental elements.

3.7. Data Extraction

Data extraction and management were handled using Google Sheets for organising data and EndNote for managing references. The data extracted from the articles were compiled into a structured database designed for easy access during subsequent analysis, synthesis, and reporting. Each entry was verified against the original articles to identify and correct any discrepancies, such as mismatched values or missing information. All coded variables, their operational definitions, and the raw study-level entries are publicly available via Zenodo (DOI: 10.5281/zenodo.17339384; uploaded 13 October 2025).
After verification, the data were synthesised to address the research questions of the systematic review. The synthesis used methods suited to the nature of the data and the review objectives, primarily through a descriptive approach that summarised and explained the data patterns by identifying trends, differences and similarities between studies. This method enabled us to draw meaningful conclusions from the diverse data collected during the review.

3.7.1. Data Extraction Methodology: Domains, Specific Tasks, Technique and Results

The data extraction process followed our research question and eligibility criteria, focusing on topics, methods, and evaluation metrics. It recorded details such as Domain Area, which defines the field addressed by each study. For datasets, both public and private sets were included. The framework and components of the RAG system were documented, listing the “Retrieval Mechanism”, “Chunking Mechanism”, “Vector Space Encoder”, and “Generation Model” while excluding any components not mentioned in the paper. All data were organised in a workbook under clear headings for easy access and analysis, providing complete coverage for detailed review.
A single reviewer, using a RAG framework, independently extracted the data to confirm accuracy and reliability. The framework treated each article as a separate knowledge source, queried by the specific data required. This approach simplified the review process and offered a method of verifying the details. Using this framework confirmed that the data collection was complete and consistent with the research criteria and objectives.
However, the RAG framework poses two major challenges. The first is the risk of hallucination, where the system may generate information that does not exist. The second is that key data might be absent from the retrieved passages. Despite the framework’s benefits in improving speed and precision, these issues call for careful cross-checking of the extracted data to maintain its authenticity and reliability. Addressing these challenges is essential to preserve the integrity of the data extraction process.

3.7.2. Dataset Identification Methodology

We systematically examined all studies in this review and used citation tracking to identify and extract relevant datasets. The extracted information was organised in Google Sheets to form a structured, navigable database, ensuring the inclusion of the most impactful and widely used resources. This organisation supported more effective analysis and comparison, making sure that the most relevant and impactful datasets were included.
Each entry lists its source reference, full official name and common abbreviation, content overview, intended use, and frequency of citations, allowing researchers to assess scope and suitability. The relevance and popularity of each dataset are highlighted by the number of papers that have used it, indicating its significant impact and widespread adoption in the field.
As shown in Table A3 (see Appendix C, Table A3), each dataset with the extracted fields: dataset name; content description, which includes details such as the number of questions; intended use, which may be described as designed to or as a high overview; citation frequency, which indicates the number of times the dataset has been mentioned in the reviewed academic papers.

3.8. Use of Generative AI

Generative AI was used in two constrained ways: (1) to provide five independent inclusion/exclusion suggestions per record during title/abstract screening using deepseek-ai/DeepSeek-R1-Distill-Llama-70B, aggregated by majority vote and presented to human reviewers; and (2) to assist data extraction by querying article text within a retrieval-augmented framework. In all cases, human reviewers made final screening decisions and manually verified extracted fields against the source articles.

3.9. Potential Biases and Mitigations

This review aimed to minimise potential biases in the search, selection, and data extraction processes. All databases returned full English texts, so no language-based exclusions were required. Using five major and complementary databases reduced coverage bias and ensured that key studies were captured. Restricting the review to 2020–2025 could introduce a time-window bias by favouring earlier, highly cited papers; however, this period was chosen to align with the emergence of modern RAG frameworks and sensitivity checks confirmed that including newer, lower-cited papers did not change the overall conclusions.
To assist reviewers, AI tools were used only in a supportive role. During screening, inclusion and exclusion suggestions generated by the DeepSeek-R1 model were presented as guidance, but all final decisions were made by human reviewers. The same applies to data extraction, where a retrieval-augmented framework helped identify relevant text, but all extracted data were verified manually against the source articles. These procedures, combined with dual-review checks and documented verification, were implemented to reduce selection and extraction bias and maintain transparency throughout the review.

4. Results

We searched ACM Digital Library, IEEE Xplore, Scopus, ScienceDirect, DBLP on 13 May 2025, covering studies published from 2020 to 2025. We identified 4721 records. After removing duplicates (1494), out-of-range records (158), and records below the citation threshold (2867), 202 records remained for title/abstract screening. Of these, 58 were excluded, and 144 reports were sought and retrieved for full-text assessment (0 not retrieved). We assessed 144 reports for eligibility and excluded 7 for irrelevance of primary focus, 7 for ancillary or insufficient emphasis on RAG, and 2 for methodological mismatch. In total, 128 studies were included. The PRISMA flow diagram (Figure 1) presents the selection process.

4.1. Excluded Studies

Following the screening of the title and abstract, 144 candidate records were recovered in full and assessed against the predefined inclusion criteria. Sixteen of these were excluded during the full text screening for the reasons summarised below. The reasons for exclusion were categorised as follows:
  • Irrelevance of Primary Focus (n= 7): Papers whose primary contributions lay outside the augmented generation of retrieval, e.g., robustness of dense search, long-context benchmarks, general GenIR evaluation or system-level optimisations, where RAG appeared only as a peripheral baseline or illustrative example [27,28,29,30,31,32,33].
  • Insufficient Emphasis or Ancillary Treatment (n= 7): Studies that incorporated RAG merely as an auxiliary component within broader investigations—such as LLM-human hybrids for marketing research, domain-specific LLM development, knowledge graph construction workflows, multimodal agent toolkits, healthcare task automation, cost-effective classification or materials modelling pipelines—without substantive and dedicated analysis of RAG itself [34,35,36,37,38,39,40].
  • Methodological Distinction (n= 2): Works focused on conceptually distinct paradigms from RAG, specifically generative retrieval or generation-augmented retrieval, which invert the standard RAG pipeline by predicting document identifiers rather than conditioning the generation on the retrieved content [41,42].
All exclusion decisions were systematically documented to ensure methodological rigour, transparency, and reproducibility.

4.2. Yearly Distribution of Identified Articles

Across 2020–2025, the number of identified articles increased year on year from 2020 to 2023, with a pronounced increase in 2024. As of 13 May 2025, the count for 2025 is lower because the year is incomplete. Figure 2 visualises the annual distribution; year-specific totals are listed in Table A2.
These counts reflect the records that remained after deduplication and application of the eligibility criteria (Section 3.3), including the citation thresholds (≥30 for publications up to 2024; ≥15 for 2025). Consequently, year-to-year comparisons should be interpreted in light of (i) the staged indexing of databases and (ii) the partial coverage of 2025 at the time of the last search.

4.3. Domain Characteristics of Included Studies

Studies were coded to a single primary domain for proportional reporting; secondary tags (e.g., multimodal, conversational) were retained for analysis but not double-counted. Coding rules and examples are shown in Table A2. The proportions below refer to the included studies (Figure 3).
These proportions show that the current evidence base is anchored in tasks where factual grounding can be measured cleanly, with nearly half of all studies concentrating on knowledge-intensive and open-domain QA settings. This concentration is useful: it provides the most stable ground for comparing methods and reporting effect sizes. It also sets expectations—findings from this literature are most readily transferable to knowledge-centric workflows, while claims about broader applicability should be made cautiously.
The presence of software engineering and medical applications suggests an early movement toward specialised, real-world use. However, coverage is still uneven, which limits how confidently methods can be lifted from benchmark QA and dropped into domain-specific pipelines without additional tuning. In areas with only scattered studies (e.g., finance, education, security, biomedical), evaluation practices are heterogeneous and often under-specified, making cross-paper comparisons fragile and highlighting the need for clearer task–metric pairing and consistent reporting of system-level outcomes.
Methodologically, the domain skew also explains why many reported gains emphasise retrieval–generation mechanics (e.g., retriever choice, reranking, context shaping) over deployment concerns. Where studies do report system metrics, improvements tend to be demonstrated under data-rich, well-indexed conditions; evidence is thinner for low-resource or safety-critical contexts where privacy, governance, or drift matter as much as accuracy. Overall, the distribution clarifies both strength and scope: robust comparative signals in knowledge-centric QA, promising but still fragmented evidence in applied domains, and clear gaps where consistent metrics and broader operating conditions remain to be established.
The mix of domains and the datasets most frequently used in the 128 studies (notably corpora derived from NQ, HotPotQA, TriviaQA, MS MARCO, and Wikipedia) suggests a literature optimised for benchmark achievement. Performance metrics such as EM, F 1 , and related QA scores are regularly used across studies, whereas quantified system characteristics, namely cost per query and end to end latency (p50/p95), are seldom reported in a standardised way. This benchmark centric focus is useful for isolating method effects under controlled conditions, yet it limits comparability on the practical trade offs that govern deployment. Future results should pair accuracy style toplines (value, Δ abs., Δ rel.) with deployment-relevant reporting of cost and latency, with clear definitions for token-based cost calculation, the hardware used, and whether retrieval is included in latency.

5. Discussion

This review synthesises a rapidly diversifying RAG landscape. In addressing RQ1–RQ4, we find that the research focusses on knowledge-intensive and ODQA tasks, with notable activity in the software and medical domains. Methodologically (RQ2), the field is shifting from the canonical DPR + seq2seq baseline to hybrid and structure-aware retrieval, uncertainty-triggered and iterative loops, and agentic orchestration that treats retrieval as a policy decision. Evaluation practice (RQ3) still leans toward overlap metrics, but retrieval metrics, human judgments, and LLM-as-judge protocols are increasingly necessary to assess faithfulness and grounding. Persistent challenges (RQ4)—cost/latency, domain change and dataset freshness, error cascades across modular pipelines, and security threats on the retrieval side—define the priorities for the next phase of RAG research and deployment.

5.1. What Are the Key Topics That Are Already Addressed in RAG?

Across highly cited RAG studies, the topics most consistently addressed fall into five tightly linked themes: how evidence is surfaced (retrieval design), how it is stored and served (vector indexes and databases), how inputs are shaped into retrievable units (chunking), how relevance is represented (encoders), and how generators actually consume retrieved context (generation patterns). Figure 4 summarises these themes as a layered RAG stack and provides a visual reference for the sections that follow. The centre of gravity remains knowledge-intensive and ODQA tasks, with notable depth in code and clinical domains. These strands depict a field moving from single-pass, single-index pipelines to modular stacks where retrieval, representation, and generation are coordinated rather than bolted together (cf. Section 5.1.1, Section 5.1.2, Section 5.1.3, Section 5.1.4, Training, and Generation Model).
These takeaways carry conditions. The evidence to date is concentrated in ODQA and a handful of high-signal domains (software, medicine); portability depends on corpus quality, refresh cadence, and the stability of chunking/metadata conventions. Many gains reflect corpus-specific tuning (e.g., chunk size, hybrid weights) and may require re-calibration elsewhere; infrastructure decisions (index type, encoder refresh) set real cost/latency ceilings that bound applicability at scale. The detailed analyses that follow (retrieval mechanism, vector databases, chunking, encoder families, training, and generation) provide the empirical footing for this synthesis.

5.1.1. Retrieval Mechanism

Retrieval-augmented generation systems depend uniformly on an external retriever to select a relevant context for a language model. In general, the mechanisms surveyed fall into five interrelated categories.
Sparse term-based methods (e.g., BM25) remain vital for their efficiency and interpretability, yet they struggle with semantic recall and gaps [43]. Dense retrievers, built on dual encoder networks such as DPR, map queries and documents into continuous vector spaces and leverage the maximum inner-product search for semantic matching [1]. Hybrid approaches combine sparse pruning of candidates with dense re-ranking to balance recall and precision across domains [44].
Encoder–decoder query generators reformulate inputs, especially conversational or multihop questions, into standalone search queries, improving recall at the cost of added latency [45]. Reclassification modules (e.g., CRAG) apply lightweight evaluators or preference-aligned models to reorder initial top k results, mitigating noisy retrievals, and aligning outputs with downstream generation needs [46].
By organising passages or entities into knowledge graphs, graph retrieval methods extract sub-graphs or paths most relevant to a query. Steiner tree formulations collecting prizes yield coherent multi-hop contexts with explicit reasoning chains, albeit at significant computational cost for large graphs [47,48].
Iterative frameworks interleave retrieval and generation: LLM outputs refine subsequent queries, progressively bridging semantic gaps in complex tasks [49]. Although this feedback loop improves multistep reasoning, it incurs increased latency and requires careful stopping criteria to prevent error propagation [50].
Specialised retrievers adapt core architectures to different data modalities or domains, such as code snippet retrieval by edit distance scoring [51], multimodal CLIP-based retrieval for image captioning [52], or clinical report retrieval using vision language embeddings [53]. These systems achieve high task relevance but demand bespoke engineering and corpus maintenance.
These mechanisms illustrate a landscape where advances in semantic embeddings, input optimisation, structural reasoning, adaptive feedback, and domain adaptation coalesce to enrich the context of LLM. Each category presents a distinct trade-off between efficiency, scalability, interpretability, and domain generality, highlighting open avenues for unified, explainable, and resource-efficient retrieval in future RAG research.
Compared with earlier IR+reader baselines and the canonical DPR + seq2seq setup, recent work converges on hybridisation (sparse+dense) and greater use of structure (graphs, schema-aware chunking), while diverging along two practical axes: domain-specialised components versus general-purpose stacks, and encoder–decoder fusion versus lightweight adapters on decoder-only models. This reframes classical IR knobs, such as index choice, chunking and query formulation, from “preprocessing” to primary design variables that directly affect recall, precision, and faithfulness.

5.1.2. Vector Database

The vector database is fundamental to RAG, enabling fast similarity searches over dense embeddings through approximate nearest neighbor (ANN) techniques such as hierarchical navigable small world (HNSW) graphs and FAISS-based flat or inverted indices, which achieve sub-millisecond Maximum Inner Product Search (MIPS) performance in production settings but must negotiate accuracy–latency trade-offs and memory footprint constraints [1,45,54]. Research has extended these core indexing methods to distributed and dynamic environments, employing GPU-sharded indices and cloud-native services like Pinecone to ingest and serve millions of vectors across training and inference pipelines; however, synchronization latency, update throughput, and cost-efficiency remain pressing concerns [55,56]. Concurrently, specialist vector stores have emerged, tailored for domain-specific applications such as code retrieval (e.g., RepoCoder), biomedical concept embeddings (Chroma), financial knowledge bases, and multimodal memory systems (MuRAG, Re-ViLM), in order to address the unique representational, alignment, and privacy demands of specialized data [50,57,58]. Finally, managed vector database offerings integrated via frameworks such as LangChain, LlamaIndex, Weaviate, and Qdrant have streamlined deployment in commercial RAG pipelines, albeit at the expense of potential vendor lock-in, hybrid architecture complexity, and unpredictable operational costs [59,60].
Despite the maturity of these infrastructures, several cross-cutting research gaps persist. Notably, adaptive indexing algorithms capable of real-time inserts and deletes without degrading search performance are under-explored, while cost-aware scaling strategies that balance query latency against infrastructure expenditure remain scarce. Moreover, ensuring seamless interoperability across heterogeneous vector database services and embedding formats presents an ongoing challenge and a fertile avenue for future RAG innovation.

5.1.3. Document Chunking

Document chunking is the decomposition of large inputs into smaller, retrievable units. It is a critical preprocessing step in RAG. Highly cited studies have converged on four principal approaches:
Static fixed-length chunking. Early RAG architectures adopt uniform, size-bounded splits to simplify indexing and conform to transformer context limits. Common configurations include 100-word segments [1,61], fixed-size 64-token chunks (with optional 32-token flexible intervals) [62], and approximately 600-character spans [63]. These static splits require minimal linguistic preprocessing and integrate readily with vector stores (e.g., FAISS), but frequently bisect semantic units, resulting in context loss and a trade-off between index growth (for smaller chunks) and retrieval precision (for larger ones).
Semantic boundary-aware splitting. To preserve discourse coherence, the subsequent work aligns the chunk boundaries with the inherent text structure. Techniques include sentence-level chunking, where each sentence becomes a chunk [64], and paragraph-level segmentation, merging short paragraphs and truncating overly long ones [54]. More advanced methods leverage hierarchical section markers (e.g., PDF sub-sections) to define semantically coherent units [65,66]. These approaches mitigate fragmentation and often improve retrieval relevance, at the cost of additional preprocessing complexity and the absence of standardised coherence metrics.
Domain and modality specific chunking. Recognising that different types of data exhibit unique structures, specialised chunking strategies have been developed:
  • Source code: partitioning by function or Code Property Graph nodes to capture logical code blocks [51,67].
  • Knowledge graphs: aggregating graph triples into textual statements for embedding [68].
  • Legal documents: breaking cases into (question, snippet, entity, answer) tuples [69].
  • Biomedical texts: micro-chunking into fixed five-token units to capture fine-grained concepts [70].
  • Multimodal inputs: splitting image–text pairs into aligned patches or entries for vision–language RAG [58].
These tailored pipelines yield superior performance within their target domains, but require manual configuration and do not generalise easily across new data types.
Adaptive dynamic chunking. The most recent research line seeks to automate chunk-size and overlap selection based on query characteristics or retrieval performance. Representative techniques include sliding windows (for example, 1000-token windows with 200-token overlaps in LangChain [71], fixed-size 1200-token chunks with dynamic overlap [72]), automated parameter search for domain-specific corpora (e.g., clinical notes [56]), and half-stride overlapping to balance novelty and context continuity [73]. Adaptive methods aim to integrate the benefits of static, semantic, and domain-specific approaches, yet remain largely experimental, facing challenges in hyperparameter optimisation, runtime overhead, and cross-domain robustness.
Over time, RAG document chunking has evolved from simple one-size-fits-all splits to sophisticated, context and domain-aware pipelines. Static segmentation offers scalability but suffers semantic fragmentation; semantic boundary methods enhance coherence but add preprocessing costs; domain-specific chunkers exploit structural priors at the expense of generality; and adaptive strategies promise end-to-end automation but require further validation. Future work should establish standardised coherence benchmarks, develop unified frameworks that dynamically leverage linguistic and domain signals, and evaluate scalability in large-scale RAG deployments.

5.1.4. Vector Encoders

In RAG systems, the vector space encoder projects both user queries and document chunks into a shared high-dimensional embedding space, enabling efficient similarity-based retrieval. Influential RAG studies fall into three principal paradigms:
Sparse encoders. Traditional IR techniques convert text into high-dimensional sparse vectors of term weights (e.g., TF-IDF, BM25), scoring relevance via inverted indices. BM25 remains a robust baseline, often combined with dense methods to increase recall in open-domain QA and hybrid pipelines [47,61]. In specialised settings such as code and graph retrieval, additional sparse schemes are common. For example, one can measure the overlap between the query and a language-specific token inventory (e.g., Java identifiers, keywords, and API names), or employ weighted n-gram counts to capture local lexical structure. These approaches deliver low latency and scale efficiently but provide limited deep semantic modelling; in RAG, they are therefore used primarily as recall-orientated components that are complemented by dense retrieval and/or learnt re-ranking [50,73].
Dense encoders. Deep learning-based dense encoders map inputs to continuous embeddings that capture contextual and semantic nuances:
  • Transformer-based bi-encoders. Frameworks such as DPR, ANCE, REALM, ORQA, and dual encoder BERT variants embed queries and passages separately, optimising retrieval metrics (Recall@k, MRR) through end-to-end fine-tuning [45,74].
  • Sentence and paragraph embeddings. Models such as Sentence-BERT, MPNet, paraphrase-mpnet-base-v2 and Contriever produce fixed-length vectors for larger text spans, improving semantic similarity on standard benchmarks [75,76,77].
  • Foundation & specialised models. API-driven encoders (e.g., text-embedding-ada-002, text-embedding-3-small/large) and proprietary systems (Dragon, E5, BGE) deliver broad coverage with minimal tuning [74,78,79]. Domain-adapted variants, MedLLaMA-13B for biomedicine [70], PubMedBERT for clinical language [57], CodeBERT/CodeT5 for source code, demonstrate versatility in specialised vocabularies [67,80].
Hybrid & multi-modal encoders. To retrieve across heterogeneous sources, modern RAG systems fuse sparse and dense signals or jointly encode multiple modalities.
  • Sparse–dense hybrids. Elastic Learnt Sparse Encoder (ELSER) integrates learnt sparse representations with dense sentence embeddings, balancing latency and recall [81].
  • Vision–language models. CLIP (text + image), LXMERT, ALBEF, and temporal deformable convolutional encoders support multimodal retrieval for visual QA and image-based generation [82,83,84].
  • Graph and sequence models. Graph Transformers and Graph Attention Networks embed structured data (knowledge graphs, ASTs) into vector spaces for retrieval-augmented reasoning [48,85].
The selection of encoders in RAG reflects a trade-off among retrieval accuracy, computational efficiency, and domain adaptability. Future work should target out-of-domain robustness, real-time index updates, and unified frameworks that seamlessly integrate sparse, dense, and multimodal representations.
For practitioners, the recurring lessons are straightforward. Hybrid retrieval stabilises recall across domains; index selection (HNSW/FAISS variants, dimensionality) should be matched to corpus size and latency budgets; boundary-aware or domain-specific chunking improves coherence when grounding spans multiple passages, while static chunking remains a robust default for speed; encoder choice should prioritise out-of-domain robustness and update cadence, not only peak scores; and generation architectures should be chosen for the amount of multipassage fusion a task truly needs, not merely model scale.

5.1.5. Training

Training of RAG models has coalesced into five interrelated paradigms, each addressing distinct trade-offs in performance, efficiency, and domain applicability.
Joint end-to-end training optimises retriever and generator components simultaneously by minimising a combined negative marginal log-likelihood loss, often through expectation-maximisation loops that alternate reader and retriever updates [1,55]. Although this yields cohesive alignment of retrieval-generation and can leverage implicit retrieval supervision, it incurs a high computational cost due to frequent document-encoder refreshes and requires careful weighting between retrieval versus generation gradients to avoid collapse of one component [49,86].
Modular two-stage approaches separate training into two steps: pre-fine-tuning dense retrievers (e.g., DPR) followed by generator tuning. This approach trades end-to-end optimality for pipeline stability and simplified hyperparameter search [87,88]. Although this separation can ease convergence and allow retrieval-specific objective design, it may lead to suboptimal global coordination and requires additional engineering to integrate retrieval scores during generation.
Parameter-efficient fine-tuning (PEFT) and instruction tuning techniques update only a small subset of model parameters, via low-rank adapters (LoRA), prefix-tuning, or lightweight mapper modules, dramatically reducing GPU memory requirements while preserving downstream performance [85,89]. These methods have been successfully applied in financial forecasting (e.g., StockGPT) and clinical QA, yet remain sensitive to adapter rank, learning-rate schedules, and the diversity of instruction data used [90,91].
Specialized training objectives increase standard cross entropy with contrastive losses (to distinguish relevant from irrelevant documents), self-critical sequence training (SCST) for sequence-level rewards and analogy or style-aware losses to capture higher-order relations or lexical emphasis [58,92]. Such multiobjective schemes can yield significant gains in task-specific metrics (BLEU, CIDEr, accuracy) but introduce additional hyperparameter tuning complexity and obscure training dynamics.
Domain and modality specific adaptation tailors RAG pipelines to code (e.g., RepoCoder in large codebases [50]), vision-language (e.g., ReViLM’s gated cross-attention for radiology reports [83]), and specialised legal or biomedical corpora [69,70]. Although these systems achieve state-of-the-art benchmark results, they face challenges in data scarcity, overfitting, modality alignment, and cross-domain generalisation.
Collectively, these training paradigms illustrate the field’s evolution from monolithic joint optimisation to modular, resource-aware and domain-focused strategies, each of which presents open problems in objective balance, compute efficiency, and transferability that continue to drive RAG research forward.

5.1.6. Generation Model

Since its introduction, RAG has evolved from a proof-of-concept dual-encoder retriever paired with an encoder–decoder backbone to a rich landscape of end-to-end retrieval–generation pipelines. The original RAG framework demonstrated that the addition of a T5-style encoder-decoder with an open-domain retriever markedly improved question answers over purely generative baselines [93]. Fusion-in-Decoder subsequently refined this approach by fusing multiple retrieved passages via late-stage cross-attention, yielding more coherent multidocument summaries [94]. In parallel, models with decoder only, such as RETRO, showed that fragment level retrieval could be interleaved within autoregressive decoding, laying the groundwork for lightweight, scalable RAG in conversational settings [95]. More recent work like Self-RAG has pushed toward self-supervised alignment of latent retrieval signals, bypassing external supervision, and underscoring a trajectory from loosely coupled retriever–generator pairs to fully integrated systems [96].
Beyond open-domain QA and summarization, highly cited studies have extended RAG to specialised and multimodal tasks. In biomedicine, BioGPT applied retrieval-augmented generation to clinical question answering, demonstrating improved factuality on medical benchmarks [97]. Legal research platforms such as Lexis+ AI and Ask Practical Law AI have tailored retrievers to statutory and case law corpora, helping practitioners with contextually grounded legal drafting [98]. Code-centric work, using models like CodeT5 and Codex, has recovered API documentation at generation time, enhancing code synthesis and reducing syntactic errors [99,100]. More recent multimodal RAG approaches (e.g., MiniGPT-4, LLaVA, Qwen2-VL) incorporate image retrieval to support visually grounded question answering, pointing to an expanding modality scope within the RAG paradigm [101,102,103].

5.1.7. Generative Model Families

Since the original RAG paper, model families have proliferated, each contributing distinct architectures, scale points, and fine-tuning strategies that shape retrieval integration:
Anthropic (decoder-only). Claude-3-Opus and Claude-3.5-Sonnet (2024) interleave retrieved context with safety-orientated controls to mitigate hallucinations in conversational QA [104].
BigScience (encoder–decoder). Bloom (2022) provided a multilingual, multisize foundation; RAG adapters later enabled domain-agnostic retrieval experiments atop this family [105].
DeepSeek (decoder-only). DeepSeek-V2-Chat (2024) embeds a lightweight retriever within a proprietary autoregressive backbone, optimising low-latency RAG for chatbots [106].
EleutherAI (decoder-only). GPT-J (2021) and GPT-Neo variants served as open alternatives to evaluate the impact of retrieval on QA without instruction tuning [107].
Google (encoder–decoder & decoder-only). Flan-T5 (base to XXL, 2022) set the standard for cross-attention fusion in summarization and QA [94,108], while PaLM-2 (XXS to 540B, Text-Bison) and the Gemini/Gemma chat series (2023–24) explore retrieval adapters in massive decoder-only contexts [109].
Meta AI (encoder–decoder & decoder-only). BART (2020) pioneered the integration of retrieval through cross-attention [110]. The Llama family (2023–2025) (Llama-1/2/3 in sizes 7B to 70B, with LoRA and quantised variants) illustrates how scale and parameter-efficient fine-tuning affect RAG on conversational and QA tasks [111,112,113,114,115].
Mistral AI (decoder-only). Mistral-7B (2023) and its Instruct, quantised, and Mixtral-8 × 7B ensemble (2024) probe the trade-offs between open source accessibility, instruction alignment, and retrieval fluency [116,117].
Nomic AI & NVIDIA. GPT4All (2025) offers on-device prototyping for lightweight RAG [118]. NVIDIA’s NeMo GPT-43B (2023) and Llama3-ChatQA (8B/70B, 2024) combine large-scale proprietary pre-training with retrieval-aware objectives for enterprise applications [44,119].
OpenAI (decoder-only). From GPT-2 (2019) through GPT-3/3.5 (2020), ChatGPT/3.5-turbo (2022), to GPT-4/GPT-4o (2023–24), OpenAI has incrementally embedded retrieval: early work pre-generated snippets to GPT-2 input [120], while GPT-4-turbo (2024) dynamically issues retrieval calls via system prompts [121,122,123].
Qwen-1.5 (decoder-only). The Qwen-1.5 lineup (0.5B to 72B, chat variant) explores multilingual retrieval for both text and code generation [124].
Despite this rich diversity, two broad patterns emerge. Encoder–decoder models (Flan-T5, BART, Bloom) excel at multipassage fusion via cross attention, making them well suited for tasks demanding precise grounding (e.g., summarization, QA). Decoder-only families (GPT-J to GPT-4, Mistral, Claude) leverage token-insertion or adapter-based retrieval, trading architectural simplicity, and inference speed for conversational flexibility. Open challenges persist: the absence of a unified, modality-spanning RAG benchmark suite; systematic evaluation of retrieval noise versus generation fluency; and thorough study of parameter-efficient fine-tuning (e.g., LoRA, quantisation) on RAG outcomes. Addressing these gaps will be critical to guide the next wave of retrieval-augmented generation research.

5.2. What Are the Innovative Methods and Approaches Compared to the Standard Retrieval Augmented Generation?

Recent work on RAG goes beyond the demonstration that “retrieval helps” to make retrieval adaptive, controllable, trustworthy, and efficient. Innovations recur along a coherent arc: plumbing that improves what enters and exits the index (structure-aware chunking, metadata enrichment, long retrieval units, token budgeting and re-ranking); front-end control that treats prompting and query formation as levers (reformulation, multi-query expansion, schema/format prompting, uncertainty-triggered lookups); retrieval that is heterogeneous and structure-aware (hybrid sparse+dense, graph- and relation-first selection); closed-loop behaviours that revise drafts and evidence on demand (uncertainty triggers, verifier-guided regeneration, batch/iterative grounding); persistent context via memory (buffers and user-specific stores); agentic orchestration that sequences tools under a policy; and efficiency/compression techniques that cut tokens, latency, and cost while preserving faithfulness. These ideas recast RAG as a policy-driven system rather than a one-shot component. We next detail how these themes manifest across the pipeline.
Relative to the canonical DPR + seq2seq baseline (single literal query, top-k concatenate), the field converges on three shifts. First, from one-pass to selective and iterative retrieval: models decide when to retrieve (entropy/uncertainty triggers), what to fetch (reformulated or decomposed queries), and how much to pass forward (re-ranking, compression). Second, from a single metric space to hybrid and structured evidence: lexical+dense fusion and graph traversal recover long-tail and multi-hop support that flat passages miss. Third, from “prompt as container” to prompt as control surface: schemas, exemplars, and explicit grounding clauses steer faithfulness as effectively as swapping retrievers, often at lower cost. In short, modern RAG optimises policy and composition as much as architecture.
These gains have conditions. Many methods are calibration-sensitive (entropy thresholds, fusion weights, chunk sizes) and may drift across domains without re-tuning. Hybridisation and verification add latency unless carefully budgeted; graph pipelines depend on entity linking quality; and memory raises lifecycle and privacy governance questions. Security is a first-class risk: retrieval-side poisoning and prompt-in-context attacks can bypass guardrails unless corpora are curated, signed, and filtered at ingest and at serve time. Finally, evaluation remains uneven—few studies jointly report accuracy, cost, robustness, and security. The detailed sections that follow (pre/post-retrieval plumbing, prompting/query control, hybrid and graph retrieval, iterative loops, memory, agentic orchestration, efficiency, and modality expansion) supply the empirical footing for this synthesis.

5.2.1. Pre-Retrieval & Post-Retrieval Stages: The Plumbing That Keeps RAG Watertight

When a clinical chatbot invents a drug dosage, the root cause is often not the language model but a silent pre-processing step that mangled the source PDF. The unglamorous work that happens before the first similarity search and after the hit list comes back, therefore, deserves as much care as fancy retrievers or generators.
  • Pre-Retrieval: How We Feed the Index
Structure-aware chunking. Pipelines now segment along headings, tables and coherent narrative blocks detected by multimodal (vision-text) encoders; on FinanceBench, element-aware chunking achieved 84.4% page-level retrieval accuracy and 53.19% manual Q&A accuracy, outperforming token-only baselines [125].
Metadata enrichment at chunk time. Generate keywords and micro-summaries for each chunk automatically (e.g., with GPT-4) to aid retrieval and avoid manual labelling; element-aware pipelines use these metadata during indexing [125], and retrieval augmentation has substantially increased accuracy in clinical deployments (e.g., GPT-4 from 80.1% to 91.4%) [56].
Curated corpus construction. Restrict retrieval to sentence-level snippets from authoritative clinical guidelines and other public sources; by indexing only such content, domain assistants avoid introducing protected health information and curb hallucinations by grounding answers in vetted guidance [126,127].
Longer retrieval units/chunks. Treat each PDF or cluster of interlinked pages as a long “retrieval unit” (≈4k tokens). This 30-fold reduction in retrieval units (for example, from 22 million to 600 thousand) dramatically lowers the retriever’s workload while preserving or even improving recall, for example, answer-recall@1 increases from 52% to 71% in Natural Questions and answer-recall@2 from 47% to 72% on HotpotQA [128]. LongRAG achieves comparable exact-match performance, EM of 62.7% on NQ and 64.3% on HotpotQA, without additional training [128].
Security at the retrieval interface. Obfuscated code IDs, L2-normalised embeddings and poison filters remind us that the retriever, not the LLM, is the outer security wall [67,129,130].
Defend the entry point. By obfuscating code identifiers, applying L2-normalisation to embeddings, and filtering poisoned content, the retriever serves as the primary line of defence in retrieval-augmented systems [67,129,130].
  • Post-Retrieval: What We Pass to the Model
Re-ranking of retrieved evidence. Employ reciprocal rank fusion or listwise autoregressive rankers to reshuffle retrieved evidence so that the most relevant passage appears first; this yields steady, low-cost improvements in accuracy and comprehensiveness [87,131,132].
Context reduction and token budgeting. Apply sentence-level context filtering (e.g., FILCO), one-line hints, or fast extractive summaries to reduce token usage while preserving factual accuracy and coherence [133,134,135].
Utility-based passage selection. Employ lightweight utility scorers that decide whether to drop, keep, or even repeat passages; a learnt “bridge” model edits passage IDs dynamically, keeping the prompt short while adapting to LLM preferences [46,136].
Noise-aware inclusion of unrelated passages. When the context allows, inserting a small number of unrelated passages can improve the accuracy of the answer in RAG; one study reports gains of up to 35% when random documents are added to the prompt, with the effect depending on position and count [137].
Early verification with local regeneration. A lightweight verifier LM diagnoses whether errors stem from retrieval (irrelevant knowledge) or grounding (unfaithful use of retrieved knowledge), and triggers only the needed correction (i.e. re-retrieve or regenerate) [138].
Adaptive context-window management. Use a budget-aware consolidator to set k to the space remaining in the prompt-trimming, merging or compressing passages as needed-so that the pipeline works across small and large context windows (for example, 4k–8k and beyond) [86,139].
These plumbing stages create a token-efficient, high-recall foundation that underpins adaptive, controllable, and cost-effective RAG architectures. This groundwork addresses the retriever directly on how to make retrieval more trustworthy and efficient in practice. In the next section, we examine how intelligent prompt and query strategies transform the front end of RAG into an active programmable interface.

5.2.2. Prompting & Query Strategies-Making the Front-End Intelligent

Standard RAG typically issues a single literal query, retrieves top-k passages, and concatenates them into a fixed prompt for generation. This baseline often treats the prompt as a static container rather than an instrument for steering retrieval and inference. In contrast, recent prompting and query strategies reconceptualise the prompt as an active control interface that selectively modulates grounding, reformulates queries, and sequences reasoning with tool use. In RAG, performance is driven less by model size than by two factors: how we frame the query (which controls what is retrieved) and how strictly we require the model to use the retrieved evidence.
Flexible grounding and structural prompting. RAG reduces hallucinations on knowledge-intensive tasks by conditioning answers on retrieved passages and enabling provenance attribution, yielding more factual outputs than parametric models alone [1]. Beyond prescriptive prompting, retrieval composition itself can regularise behaviour: deliberately adding irrelevant documents (“noise”) to the context can improve answer accuracy and robustness by counteracting misleading high-scoring passages [137]. Domain scaffolds further formalise evidence: workflow synthesis expressed in JSON [140], organ label tags for radiology reports [141], or hybrid text-graph templates for multi-hop knowledge-graph reasoning [142]. Compared to the free-form concatenation of standard RAG, these wrappers restrict the output format, reduce cognitive load, and improve faithfulness by aligning the generator’s attention with well-written evidence.
Relative to baseline RAG, structural prompting improves relevance and robustness by imposing schemas that suppress spurious correlations, though it may add authoring overhead and requires schema governance to avoid brittleness in open-domain settings [1,137,140,141,142]. Future work should quantify how schema granularity trades off against generalisability across domains.
Query reformulation, expansion, and selective triggering of queries Allowing the model to expand or rewrite a user query typically improves recall by surfacing semantically diverse contexts; multi-query expansion bundles, as well as merge evidence downstream [131]. However, issuing additional queries indiscriminately increases latency and noise. To address this, uncertainty-aware controllers such as FLARE and RIND+QFS trigger retrieval only when token-level entropy spikes, thus avoiding unnecessary index lookups and focusing retrieval on genuinely uncertain spans [143,144]. In specialised settings, lightweight agents first extract salient entities (for example, disease names) and then query structured stores to reduce vocabulary mismatch and improve precision [57]. For streaming code completion, continuous query updates track the evolving context so that cross-file references remain current, a capability that standard single-shot RAG lacks [145].
Compared to baseline, reformulation and entropy-triggered querying improve recall-precision balance and control latency, but they rely on robust fusion or re-ranking to prevent evidence dilution when multiple queries are issued [57,131,143,144,145]. Open questions include how to calibrate entropy thresholds across domains and how to amortise multi-query costs under tight latency budgets.
Example-augmented prompting (retrieval-augmented in-context learning). Retrieval-augmented in-context learning dynamically inserts near-neighbour exemplars while assembling the prompt. Systems such as R-GQA and MolReGPT incorporate similar question–answer pairs, improving accuracy at a modest token cost [60,146]. Time-aware variants add hard negative examples so that the model learns when not to retrieve, mitigating over-reliance on stale or irrelevant context [88]. Confidence-conditioned prefixes further allow the generator to modulate trust in retrieved snippets by signalling low certainty, which reduces the risk of over-fitting to misleading passages [83]. Relative to standard RAG, which typically lacks task-specific exemplars, these strategies better align the prompt distribution with the current query manifold.
Example-augmented prompting enhances relevance and robustness, particularly for specialised or temporally sensitive queries, but raises curation questions (which exemplars, how many and how to manage drift) and requires careful token budget management to avoid context saturation [60,83,88,146]. Promising directions include adaptive exemplar selection driven by utility estimates rather than fixed k.
Deliberate reasoning before retrieval. ReAct-style prompts interleave Thought, Action, and Observation, allowing the model to plan tool calls, execute retrieval, and revise its plan iteratively [147]. Graph-of-Thought extends this idea by decomposing the question into sub-problems, each with a targeted retrieval hop, before composing a final answer [148]. These patterns depart from the standard RAG one-pass pipeline by explicitly sequencing reasoning and evidence gathering. However, such scaffolds can accidentally expose sensitive content if intermediate thoughts are logged or reflected back to the user, underscoring the need for strict access control and privacy-aware prompt design [149].
Reason-first strategies improve multi-step fidelity and reduce retrieval of irrelevant context by aligning evidence to sub goals, at the cost of additional control complexity and potential privacy risks if traces are not properly contained [147,148,149]. Future research should formalise safety-preserving variants that preserve trace benefits without leaking private artefacts.
Operational policy, fusion, and safety. Empirically, explicit prompt policies, such as clear grounding clauses, zero-temperature reasoning steps and domain-specific wrappers, often match or exceed the benefits of introducing a new retriever [65]. However, query expansion must be paired with fast fusion or re-ranking to curb latency and maintain precision as the number of evidence candidates grows [131]. Field-specific schemas (for example, ECG JSON blocks in cardiology) improve reliability in safety-critical applications relative to open completions [150]. Finally, the prompt itself is an attack surface; sanitising complex instructions and constraining tool outputs are, therefore, mandatory operational controls.
Overall implications for the research question. Across these categories, innovative prompting and query strategies advance, challenge, and in some cases redefine standard RAG by (i) making grounding adaptive and schema-aware, (ii) coupling query reformulation with uncertainty-aware triggering, (iii) leveraging exemplar retrieval to shape the prompt distribution, and (iv) sequencing reasoning to target retrieval more precisely. In general, these methods often yield larger improvements per unit cost than architectural changes, particularly when prompts are treated as versioned, testable artefacts, as code would treat, so that RAG systems become more controllable, economical, and safe to deploy [65,131,150].

5.2.3. Hybrid and Specialised Retrievers: No Single Needle-Finder

Early RAG systems typically rely on a single, dense passage retriever whose top-k chunks are appended, wholesale, to the generator input. A striking commonality in the more recent literature is the rejection of this monolithic design in favour of hybrid retrieval: lexical and dense signals are combined, cascaded or adaptively weighted, often alongside domain-specific similarity functions or graph indices. The consensus that emerges is clear: no single similarity metric can surface every useful evidence fragment.
Work in the clinical domain illustrates the value of score-level fusion: MEDRAG aggregates BM25 with up to three dense retrievers by Reciprocal Rank Fusion and records 3 to 6 percentage points gains in top-5 recalls for medical QA [65]. A more generalisable variant is the Blended Retriever, which stitches together BM25, KNN-dense, and sparse-encoder indices behind a unified API; exhaustive sweeps over six query formulations reveal that the fused output consistently outperforms the best individual index, without task-specific fine-tuning [81]. Similar ideas appear in open-source toolkits such as Auto-RAG, which expose multiple indices at runtime and leave the choice to a lightweight policy learner or the user [151]. These studies collectively suggest that recall drops caused by the “long tail” of lexical variability can be mitigated without costly supervision, provided that one is willing to maintain multiple indices.
Several papers move beyond static mixtures and train the system to adaptively decide where to sample evidence. In event argument extraction, Adaptive Hybrid Retrieval samples pseudo-demonstrations from continuous semantic regions defined jointly over document and schema embeddings, delivering a five-point F1 improvement over nearest-neighbour baselines [77]. A complementary strategy, introduced for legal case reasoning, learns dual embeddings: one space captures the similarity between questions, the other captures the affinity between questions and support. Then optimises a weighting scheme that can privilege either signal depending on the input [69]. Such results hint at a future in which hybrid retrieval is learned rather than manual operation.
Hybridisation is especially powerful when it exploits structure that generic dense vectors cannot easily encode. For knowledge-graph question answering, a dual-level pipeline first retrieves entity neighbours or thematic nodes using keyword matching, then refines the candidate set with vector similarity; this combination captures both symbolic locality and semantic relatedness and proves markedly more accurate than flat chunk retrieval [72]. In code intelligence, lexical overlap remains a robust signal of syntactic similarity, whereas a fine-tuned dense retriever better captures semantics; a two-stage hybrid first filters with BM25 and then re-ranks with a CodeT5-based encoder, cutting irrelevant patches by more than one third [80]. Multimodal cascades follow the same philosophy: an image-to-text system uses CLIP similarity to shortlist images whose titles match a visual prompt, then applies a text encoder to retrieve the precise passages required for answer generation [63].
Hybridisation also affects how evidence is consumed. RETRO++ routes the single most relevant chunk directly to the decoder, where it can influence every token, while sending additional passages to the encoder as background context, yielding significant gains on open-domain QA without increasing sequence length [152]. Such architectural nuances reinforce the broader lesson that retrieval and generation cannot be optimised in isolation.
Although the quality gains are unambiguous, hybrid designs are not free. Maintaining several indices requires more memory and imposes separate refresh cycles; empirical studies report end-to-end latency increases of 5–50 ms per query on commodity GPUs. Where low latency is mandatory, selective trigger policies, e.g., avoiding dense retrieval for purely factual lexical queries, recover much of the benefit at a fraction of the cost [135]. However, very few papers measure index update overhead or the engineering effort needed to keep blended systems in sync with evolving corpora.
Two methodological gaps remain. First, cross-domain robustness is largely untested: most hybrids are tuned and evaluated on the same corpus, leaving questions about their behaviour when the domain shifts. Second, security aspects, how fusion strategies cope with poisoned subindices or adversarial trigger documents, are almost entirely unexplored. Bridging these gaps will require shared benchmarks that couple quality metrics with latency, energy, and robustness reporting.
The evidence base demonstrates that retrieval heterogeneity is a virtue: lexical scoring anchors precision, dense vectors widen semantic recall, structure-aware indices inject domain priors, and increasingly, learnt policies decide which mixture to trust. Treating retrieval composition as a first-class, configurable module, rather than a line in the appendix, appears to be essential for the next generation of reliable and efficient RAG systems.

5.2.4. Structure-Aware & Graph-Based RAG: “Talk to Me in Triples, Not Tokens”

A growing strand of work argues that retrieval-augmented generation should reason over relations rather than over flat passages. By turning documents, captions or code into nodes and edges, these systems place LLMs in environments where neighbourhood, path and provenance are explicit. The result is a family of structure-aware or graph-based RAG pipelines that differ from the canonical baseline DPR + seq2seq at every stage, from indexing to decoding.
The first departure is at retrieval time. Instead of ranking passages in isolation, systems such as G-Retriever construct a minimal connected sub-graph that already encodes multi-hop context before it is shown to the LLM [85]. Knowledge-Graph Prompting extends the idea to ad hoc graphs built on whole document collections, thereby recovering passages that are jointly rather than individually relevant [47]. Biomedical variants prune domain KGs aggressively: KG-RAG selects only the “prompt-aware” neighbourhood of SPOKE, halving token expenditure without loss of precision [57]. Across these studies, the lesson is consistent: a few well-chosen triples beat many loosely related sentences.
Once a graph has been selected, it must align with the token world of the generator. Two strategies dominate. Soft-prompt projection feeds the LLM a dense prefix derived from a Graph Neural Network encoder; Graph Neural Prompting shows that a learned projector lets the language model attend to sub-graph semantics without retokenising long edge lists [153]. In contrast, mixed-modal encoders treat each document embedding as a latent token. The xRAG architecture concatenates a textual view with such projected embeddings, while RAG-Token marginalises over latent documents so that each generated word may be conditional on a different evidence source [51]. These designs blur the boundary between retrieval and generation, but they also introduce computational overhead: fast approximate decoding is now an open system challenge.
Manual curation of graphs is untenable, so recent work automates their creation. A graph-based text indexer segments documents, extracts entities and relations with an LLM, then maintains the structure as a hybrid keyword and vector store that supports both lexically exact and semantic queries [72]. Customer-service pipelines construct dual-level graphs in which intra-ticket trees are inter-linked via clone or reference relations; a two-step process retrieves a sub-graph and then issues Cypher queries for precise answer extraction [154]. In code intelligence, static analysis graphs are fused with retrieved exemplars so that program repair models reason simultaneously over abstract syntax and concrete fixes [80,155]. Across these domains, the “graph first, dense fallback” has become the pragmatic recipe: traversal is attempted, but vector similarity remains a safety net.
Structure-aware RAG is also proving its worth beyond text. Vision language pipelines ground image regions in Wikidata entities using a CLIP-based retriever, allowing captioning models to cite explicit facts rather than hallucinating [53]. Multimodal captioning systems encode images, retrieved captions, and their cross-caption relations in a single transformer, improving rare-concept coverage and faithfulness [156,157]. These studies confirm that the graph perspective can bridge modality gaps as well as logical ones.
The empirical gains are grouped around three themes. First, the answer faithfulness rises when the model can quote paths or node identifiers, giving analysts concrete error traces [64,85]. Second, token efficiency improves because graph neighbourhoods are far denser information carriers than flat chunks; prompt length drops by 40–60% in biomedical QA [57]. Third, graphs offer natural hooks for explainability: Users can inspect which edge or entity grounded a statement, an impossibility when evidence is a text passage spanning many pages.
However, significant obstacles remain. Current pipelines are based on the linkage of weak entities and the cost of stale or mislinked nodes. Incremental update algorithms exist [72], but their impact on answer drift over months is unknown. Finally, evaluation practices lag: while factual QA has BLEU and EM, graph RAG lacks agreed metrics for edge coverage or topological correctness, hindering cross-paper comparison.
We expect structure-aware RAG to converge on three design principles: lightweight on-the-fly KG construction; learned policies that choose pragmatically between graph traversal and vector search; and plug-in projection layers that make any LLM “graph-ready” without bespoke retraining. As modalities proliferate (e.g. tables, time series, 3-D scenes) the foundational insight stays the same: represent knowledge in the form that preserves its relations, then let the language model converse in that richer vocabulary.

5.2.5. Iterative & Active Retrieval Loops: From Static Context to Conversational Search

Work published during the past two years reveals a decisive migration from the traditional “retrieve-then-generate” pipeline towards closed-loop systems in which LLMs continually query external knowledge, inspect their own draughts, and revise both the retrieval context and the answer. These approaches treat the retriever not as a one-off helper but as a conversational partner that can be invoked, ignored, or re-invoked in response to model uncertainty, verification feedback, or evolving sub-goals.
A first line of work equips the generator with uncertainty trigger. In FLARE, the model examines each newly generated sentence for high-entropy spans; when token-level uncertainty exceeds a threshold, it halts generation, masks those spans, emits a focused search string to the retriever, and then regenerates the sentence [143]. A related attention-based mechanism, RIND+QFS, similarly uses uncertainty triggers to decide when and which tokens should form subsequent queries, improving recall without compromising precision [144]. Real-time Information Needs Detection (RIND) combined with Query Formulation by Self-attention (QFS) generalises this idea by blending token-level entropy with self-attention salience to decide when to retrieve and which tokens should form the query [144]. Consider the question “How many valves does the human heart have?”. In these designs, the model literally emits a search string token (e.g., <SEARCH> how many valves in the human heart?), which the orchestration layer interprets as a call to the retriever, giving the loop an explicit and inspectable hand-off point.
SELF-RAG uses reflection tokens (Retrieve, ISREL, ISSUP, ISUSE) to trigger retrieval, assess evidence and critique outputs, giving segment-level control [96]. The biomedical Self-RAG further extends the mechanism by training a domain-specific critic language model whose reflection tokens signal both the need for retrieval and the subsequent relevance of the evidence [158]. Collectively, these studies demonstrate that trigger on model uncertainty recovers the majority of the accuracy gains of full iterative pipelines while invoking the retriever only when it is genuinely useful. Parallel work on agentic systems confirms this principle: SELF-RAG [96], DRAGIN [144] and TA-ARE [88] introduce explicit decision tokens, entropy thresholds and veto classifiers that suppress unnecessary searches, trimming 15–45% of context tokens with negligible loss in fidelity.
A second group of research emphasises iterative refinement. The CHAIN-OF-NOTE (CON) framework obliges the LLM to write concise “reading notes” for each retrieved document, thereby exposing document reliability and reducing hallucination before synthesis of the final answer [159]. Batch grounding strategies process evidence in successive mini-batches, stopping as soon as adequate justification is found and injecting the progressively revised answer back into the context, a tactic that curbs noise and token bloat [48]. RAT performs a stepwise revision of an explicit chain of thought, generating a new query for each reasoning step and localising corrections instead of rewriting entire explanations [160]. Verification-driven loops such as KALMV enact automatic error rectification: if a verifier flags a retrieval or grounding fault, the pipeline re-retrieves new passages or re-generates the answer until the verifier is satisfied, closing the loop on both failure points [138]. Agentic pipelines strengthen this pattern by exposing each stage—retrieval, reranking, refinement and generation as discrete—inspectable actions inside modular toolchains such as RALLE [61] and MEDRAG [151], making revision steps debuggable and reusable.
When the original query is too sparse or ambiguous for high-recall retrieval, generation-augmented loops become effective. ITER-RETGEN feeds the intermediate draught of the model back to the retriever, providing increasingly informative queries at each turn [49]. ITRG offers two complementary modes. Refine, which updates an existing draft with only newly retrieved documents. Refresh, which starts afresh from the latest evidence. This shows that alternating between these modes improves long-form document generation [92]. RepoCoder adopts the same principle for code completion, appending the most recent code continuation to the retrieval query so that cross-file context converges towards the intended target snippet [145].
A fourth strand decomposes the original task into smaller sub-problems and retrieves evidence in a multi-hop fashion. RA-ISF first checks whether the LLM already knows the answer, then filters irrelevant passages, and finally decomposes unanswered questions into simpler subquestions, recursing until each leaf is resolved [161]. SearChain externalises the reasoning trajectory as a Chain-of-Query tree, allowing the IR engine to verify or veto each hop and permitting back-tracking when evidence contradicts prior steps [162]. Graph-oriented systems traverse knowledge graphs node by node, either through an LLM-guided agent [47] or via a divide-and-conquer ego-graph search with learnable pruning [142], thereby combining symbolic relational structure with neural retrieval. Reason–act loops in the agentic literature echo this multi-hop spirit, alternating between planning, external tools and answer revision to accumulate evidence from diverse sources—for instance a ReAct-style clinical assistant [82] or the Retrieval-augmented Recommender System [52].
Finally, several papers exploit self-consistency or memory. SelfMem alternates between producing multiple candidate memories and selecting the best one to seed the next round of generation, enabling the model to bootstrap its own knowledge without external corpora [163]. A related idea is used in activity-pattern generation, where multiple hypothetical trajectories are rated for alignment with historical data before the most self-consistent plan is chosen [164]. The Knowledge-to-Response architecture separates knowledge prediction from response generation, gives an explicit checkpoint that can be inspected or re-executed if downstream verification fails [83].
Across these diverse implementations, a set of common lessons emerges. First, retrieval should be policy-driven: systems that fire the retriever only under measured uncertainty or verified need to gain most of the quality benefits at a fraction of the computational cost. Second, local revision (editing one thought, sentence, or document at a time) prevents prompt lengths from exploding and keeps provenance transparent. Third, closed loops demand fail-safes: lightweight critic LMs or verifiers effectively halt divergence when early retrieval or generation steps go wrong. Lastly, latency and energy budgets vary dramatically between designs; rigorous reporting of retrievals-per-answer, wall-clock delay, and GPU minutes is essential if future work is to compare accuracy improvements on an equal footing.
These iterative and active retrieval loops recast RAG as an interactive search companion. By recognising their own knowledge gaps, gathering fresh evidence on demand, and continuously revising their reasoning, modern RAG systems approach the discipline of a human researcher. The next frontier is to make these loops budget-aware and embed them in evaluation frameworks that reward knowledge fidelity and resource efficiency.

5.2.6. Memory-Augmented RAG: Personalisation and Long-Horizon Context

Early retrieval-augmented systems were stateless: each turn re-embedded the user’s query, retrieved passages, concatenated them, and produced an answer. However, domains like education, clinical care and personal assistance benefit from knowledge that accumulates and varies by user. Thus, a family of memory-augmented RAG architectures has emerged, persisting dialogue turns, sensor readings, search history or model-generated thoughts beyond a single query.
One line of work introduces short-horizon conversational buffers. In education, MoodleBot allocates a vector store per course and rewrites follow-ups into standalone queries that include recent turns; students rate its coherence far above a buffer-free baseline [59]. Likewise, LangChain’s ConversationBufferMemory retains the chat transcript for retriever and generator use, boosting F1 by over eight percentage points in follow-up QA benchmarks, largely across domains [66].
Beyond fleeting context, some systems maintain persistent, user-specific memories. LiVersa’s hepatology assistant separates long-term documents (e.g., discharge summaries), short-term signals and a dynamic slot of the fifteen latest queries. Selective retrieval from these stores cuts hallucinations by ∼25 per cent and halves prompt length [165]. The entity-centric store K E timestamps canonicalised entities from browsing history, storing compact IDs rather than raw text; this achieves personalisation with strong privacy and mere megabytes per user [166]. Similarly, the agentic Brain logs every perception–thought–action tuple and recalls them to aid planning in complex optimisation tasks [71].
Another approach embeds memory within the model. Retrieval Augmentation Mechanism (RAM) for video captioning initialises a key–value store with hidden states from teacher-forcing; at inference the decoder attends this store, injecting linguistic and visual cues and raising CIDEr by nearly 10 per cent on MSR-VTT [167]. SelfMem appends its own generations to a growing memory pool, lowering retrieval latency over time while BLEU keeps improving [163]. A clustered memory module groups millions of examples into centroids, allowing soft interpolation or hard selection so the generator exploits abstracted task knowledge rather than a few nearest neighbours [168].
Despite gains in coherence, relevance and efficiency, open challenges remain. Few works address memory governance: LiVersa encrypts clinical memories at rest and K E avoids raw text, yet no standards exist for retention, revocation or audit. The second issue is forgetting: none of these works implement principled eviction or decay, despite stale or erroneous memories causing model drift. Finally, evaluation stays narrow: basic accuracy metrics dominate, while longitudinal measures (trust calibration, drift detection, catastrophic memory errors) are seldom reported.
The memory-augmented RAG shifts from “answering the current question” to “accompanying the user over time”. Whether through lightweight buffers, structured personal knowledge graphs, or train-time key value abstractions, integrating memory with retrieval and generation paves the way for truly adaptive, user-centred assistants. To move beyond prototypes, these systems must tackle privacy, life-cycle management, and long-term robustness.

5.2.7. Agentic & Multi-Tool Pipelines: Orchestrating Reasoning, Tools and Memory

Where the previous sections zoom in on what to retrieve (hybrid indices, structure-aware graphs), when to retrieve (uncertainty-driven loops) and where to store past context (memory buffers and personal knowledge bases), the emerging notion of an agent asks a broader systems question: How can a language-model controller weave all of these capabilities, including retrievers, memories, external APIs, calculators, and even other LLMs, into a single adaptive execution plan?
Under the hood, each agent exposes a toolbox of heterogeneous capabilities. Hybrid retrievers supply both lexical and dense evidence; structure-aware traversals explore knowledge graphs; memory stores cache past interactions; and domain plugins execute arbitrary APIs—from code compilers to database queries. For example, MEDRAG’s laboratory orchestrates five discrete steps (Judger, Retriever, Reranker, Refiner, Generator) in a fixed graph, while RALLE provides practitioners with a drag-and-drop canvas to create custom pipelines in real time [61,151].
How does the controller decide its next move? Research is grouped around three design patterns. In static graphs, the flow is scripted (for example retrieve–rerank–generate), but nodes can be toggled at runtime (for instance, switching from a general-purpose index to a proprietary one when domain drift is detected). Dynamic planning agents interleave Thought, Action and Observation tokens, letting the model plan each step; for example, should it consult the calculator or dive into long-term memory next? And learned controllers treat tool selection as a reinforcement-learning problem, optimising for latency, cost and accuracy under real-world constraints [52,82,154].
Memory is not an afterthought but a peer of retrieval. Short buffers prevent conversational dead ends, but true agency emerges when the system logs every perception-thought-action tuple for hours or even days. LiVersa’s hepatology assistant splits data into long-term documents, streaming vitals and a sliding window of recent queries; the result is a 50% reduction in hallucinations and half the prompt length [165]. The Brain architecture goes further, treating each memory as an explicit action token that the agent can revisit when planning complex optimisation tasks [71].
Orchestration unlocks tangible benefits and magnifies new risks. On the upside, agents can superintend long-horizon workflows (from syllabus design to lab automation), hot-swap tools when one fails, and gracefully fall back on alternative evidence sources. However, this flexibility invites debugging nightmares: tracing a misstep through a branched execution graph is much harder than inspecting a single “retrieve-then-generate” call. Credit assignment across cascaded tools remains unresolved, and persistent memories demand rigorous governance for retention, revocation and audit [166].
Looking ahead, agentic RAG must mature from ad hoc scripts to dependable infrastructure. We need vendor-neutral DSLs to describe tool graphs, unified dashboards that report accuracy alongside latency, energy consumption and privacy metrics, and formal memory policies that prevent drift and data leakage. Once these scaffolds are in place, controllers will be free to juggle dozens of modules, turning retrieval-augmented models into retrieval-augmented systems.

5.2.8. Efficiency & Compression: Token Budgets Still Matter

The first time a production team wired a 32 K-token model into its help-desk bot, the GPU bill doubled overnight. The lesson landed quickly: long contexts feel free, but every extra symbol still burns memory, latency, and cash. Recent papers therefore chase leaner recipes that keep answers faithful whilst maintaining efficiency [62,169,170].
Why carry an entire document when a single learned vector will suffice? xRAG maps each retrieved passage to a single document token, reducing the retrieved context from roughly 175 tokens to one and delivering task performance comparable to uncompressed RAG, while also lowering compute (a 3.53 × reduction in GFLOPs) and improving speed (a 1.64 × speed-up in CUDA time) [169]. Biomedical variants prune entire graph branches; Cypher-RAG++ restricts its prompt to “prompt-aware” triples and nevertheless improves robustness [57]. Even simple prompt engineering helps: RAPT stores most tunable weights in a global prefix and keeps per-example infixes small [76].
A bloated index slows everything downstream. One group runs an asynchronous re-encoder that refreshes FAISS shards while the system is online, so nightly jobs never block training [93]. Another treats megabyte-scale PDFs as single “long retrieval units”, resulting in thirty-fold smaller indices but the same recall [128]. Toolkits such as Parrot and Auto-RAG now expose multiple vector stores and show that picking the right dimensionality can improve speed better than another hardware upgrade [56,65,89].
PipeRAG drags passages from the CPU while the GPU is already decoding, roughly cutting a third off end-to-end latency [62]. RAGCache predicts which passages are likely to be reused, warms the key-value cache, and initiates speculative decoding before the retriever responds. In a production trace, this approach reportedly halved the US dollar cost and reduced the 95th-percentile latency by 200 ms [79].
RETRO++ adjusts retrieval cadence analogously to adaptive bitrate streaming: fetch every token for maximum quality, or every few hundred for speed; quality degrades smoothly rather than collapsing [152]. PipeRAG pushes adaptivity further, tuning its cadence at runtime to respect a global latency budget [62]. Other teams precompute dense knowledge stores offline, shifting the heaviest computation away from the critical path [157,171].
In these approaches, compression is no longer a lossy compromise; it is a design posture. Whether by projecting documents into single embeddings, refreshing indices on the fly, overlapping compute, or throttling retrieval frequency, modern RAG systems show that frugality can coexist with accuracy. Future benchmarks should report energy (joules) and monetary cost alongside EM and BLEU; otherwise, we will continue to top leaderboards whilst exceeding budget constraints.

5.2.9. Modality Expansion: RAG Beyond Plain Text

Early RAG systems treated all knowledge as plain text; however, a single X-ray caption or table row can transform an answer by taking into account document layout context. Imagine a disaster-response chatbot that not only quotes tweets but overlays them on live satellite imagery. This fusion is now within reach thanks to unified multimodal backbones. MuRAG, for example, couples a Vision Transformer with a T5 encoder–decoder so that images and text share the same embedding space, letting a prompt about “the mysterious lesion” fetch both radiology reports and the relevant chest X-ray as a single learned token projection, without retraining the language model for each modality [58,169]. Meanwhile, xRAG shows that whole documents (whether PDF, PNG or CSV) can collapse into one compact token, reducing context length and memory use without sacrificing answer quality [169].
Beyond model-level tweaks, contemporary orchestration frameworks expose pluggable components: engineers can configure CLIP-style embedding models for image/text retrieval, Whisper-based audio transcription and HTML/CSV/Excel loaders with minimal code changes, and then index outputs in interchangeable vector stores. In practice, frameworks such as LangChain provide loaders for web pages and YouTube transcripts, Whisper parsers, Pandas/CSV tooling and a common vector-store interface; this allows a single workflow to draw on web pages, video transcripts and tabular datasets, with retrieval improving grounding in downstream generation [89,172].
In clinical imaging, one line of work retrieves text using contrastively pre-trained vision–language encoders (e.g., ALBEF) and then prompts general-purpose language models (including GPT-4) to draft radiology findings; a separate line develops grounded report generation that links textual findings to specific image regions, improving traceability beyond text-only outputs. Beyond imaging, retrieval augmentation has also been explored for lay-language clinical communication and explanation [141].
Yet challenges remain: CLIP-style joint spaces work well for vision and language but falter on tables or code snippets; scale-up strains storage budgets when every video frame becomes an index entry; and privacy controls for sensitive modalities, from medical scans to CAD files, have no industry standard. Addressing these gaps will make multimodal RAG not just possible, but dependable.
Practical implications follow directly. Start with low-cost wins: add re-ranking and light context filtering; make k and inclusion budget-aware; version prompts/schemas as first-class artefacts. Prefer hybrid sparse+dense retrieval for recall stability; use boundary- or domain-aware chunking when coherence matters; and adopt uncertainty-triggered querying so retrieval fires only when needed. Wrap generation with lightweight verification or critique to localise fixes (re-retrieve vs. regenerate). Treat memory as a product surface (short buffers now; user stores only with governance). For efficiency, consider document-token projection or long retrieval units to shrink context, overlap CPU retrieval with GPU decoding, and schedule index refreshes to balance freshness against cost. Report not just accuracy but retrievals-per-answer, latency, and spend.

5.2.10. Synthesis & Outlook

The evidence in this review indicates a clear shift from the canonical DPR + seq2seq pipeline towards modular, policy-driven architectures. Hybrid indices broaden coverage; structure-aware retrievers identify relations that are otherwise difficult to detect; and uncertainty-triggered loops request additional evidence only when model uncertainty is high [81,85,143]. The combined effect is higher top-k recall without overloading the generator with unnecessary tokens.
Closed-loop control and lightweight critics have transformed retrieval from a static pre-retrieval step into a dynamic, in-generation process. Verifiers can filter low-information snippets during generation, and memory buffers retain relevant prior context. Early deployments in medicine and education report reduced hallucination and improved personalisation [66,165]. Efficiency techniques, such as document projection, speculative decoding, cache-aware scheduling, demonstrate that speed need not be sacrificed for rigour [62,169]. Token budgets remain a constraint; the most efficient token is the one the generator never processes.
Despite these advances, the field continues to rely on incomplete quality signals. Benchmarks often prioritise accuracy and rarely report cost. Few studies record retrievals per answer, GPU minutes or carbon emissions, and even fewer analyse how compromised sub-indices may influence agentic planning. Memory governance issues are seldom emphasised in system evaluations. Without shared yardsticks, reported gains are not readily comparable.
Future work should prioritise three directions to support the transition of RAG systems from prototypes to dependable infrastructure: developing holistic benchmarks that report not only accuracy but also retrieval latency, energy consumption and privacy guarantees; treating retrieval strategy as a resource-allocation problem, with policies that respect time, token and compute budgets rather than fetching evidence indiscriminately; and defining open, vendor-agnostic interfaces for heterogeneous indices (graphs, tables, images, streams) to enable drop-in retrievers without extensive pipeline refactoring.

5.3. What Are the Most Frequently Used Metrics for Evaluating the Effectiveness of Retrieval-Augmented Generation Systems?

Evaluating RAG reliably means scoring two coupled behaviours: can the system surface the right evidence, and does it use that evidence faithfully. In practice, the literature clusters around three families of measures. First, low-cost automated generation metrics (Accuracy, EM, F 1 , BLEU/ROUGE, BERTScore, Perplexity) remain the most common because they scale, though they mostly capture surface overlap or semantic similarity rather than grounding fidelity. Second, retrieval metrics (Recall@k, Precision@k, MAP/MRR, nDCG, Hit@k, R-Precision) assess whether relevant evidence is present and well ranked, which, as we have seen, is an essential precondition for faithful generation. Third, human and LLM-as-judge assessments fill what the automated metrics miss: correctness relative to sources, hallucination/groundedness, completeness, clarity, safety, and user-centred qualities (satisfaction, usability). These streams show that no single metric suffices; the effectiveness of the RAG is intrinsically multidimensional.
Relative to the classic NLG evaluation, RAG adds two notable shifts. The first is grounding awareness: beyond EM/ F 1 , studies increasingly score whether each claim is supported by retrieved context (support/faithfulness labels, groundedness checks) and whether the system abstains when evidence is insufficient (rejection/abstention rate). The second is system cost and robustness: reports are beginning to include latency, retrievals-per-answer, token budget, and sometimes cache/compute cost, alongside robustness to noise, adversarial passages, and prompt-in-context attacks. LLM-as-judge protocols are used to approximate human judgements at scale, but must be calibrated for bias and prompt sensitivity.
These choices have limits and conditions. Many automated scores reward lexical overlap, not factual grounding; LLM-as-judge can drift with prompt or model changes; human annotation is costly and variable; retrieval metrics can look strong while answers remain unfaithful; and few papers jointly report cost, robustness, and security alongside accuracy. Consequently, we recommend versioning prompts/judges, publishing annotation guides and agreement, and pairing answer metrics with retrieval diagnostics and budget/cost reports. The detailed sections that follow (automatic generation and retrieval metrics, human and LLM-judged evaluation, automated frameworks, benchmarks, and datasets) provide the empirical footing for this synthesis.

5.3.1. Automatic Generation Metrics

Automatic generation metrics quantify the fidelity, fluency, and informativeness of RAG outputs without human intervention. They fall into four broad categories: (1) classification-based metrics, (2) overlap-based n-gram metrics, (3) probabilistic metrics, and (4) specialised diversity and grounding metrics. Each offers unique insight and carries distinct limitations in the evaluation of retrieval-augmented generation.
Accuracy measures the proportion of responses generated that are correct in the total number of outputs. It provides a straightforward gauge of answer correctness, although it ignores partial matches or semantic equivalence [46,137]. Exact Match (EM) is a stricter binary metric: it reports the fraction of outputs that coincide exactly (character-for-character) with one of the reference answers [1,93]. EM is essential in tasks demanding verbatim precision, such as code generation or fact retrieval, but does not give credit for near-correct paraphrases.
F 1 score is the harmonic mean of the precision and recall of the token level:
F 1 = 2 × Precision × Recall Precision + Recall
Precision is the fraction of overlapping tokens in the generated output that also appear in the reference; recall is the fraction of reference tokens recovered in the output. F 1 allows partial credit for overlap and is widely used in QA and summarization benchmarks (e.g., SQuAD, WebQSP) [93,153].
BLEU (Bilingual Evaluation Understudy) measures n-gram precision relative to one or more references and applies a brevity penalty to discourage overly short outputs:
BLEU = BP × exp n = 1 N w n log p n
where p n is the n-gram precision for n up to typically 4 [60,141]. Despite its ubiquity, the reliance of BLEU on exact n-gram matches leads to poor sensitivity to synonymy and paraphrase [146,151].
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) emphasises recall of n-gram matches; the ROUGE-L variant measures the longest common subsequence (LCS) between candidate and reference:
ROUGE - L = LCS length ( reference )
ROUGE-L captures sequence-level cohesion and is especially prevalent in summarization and long-form QA [1,149]. However, like BLEU, it fails to capture semantic similarity beyond surface overlap.
METEOR (Metric for Evaluation of Translation with Explicit ORdering) extends n-gram overlap by incorporating stemming, synonym matching, and a fragmentation penalty. Calculates a weighted F-mean of unigram matches, typically showing higher correlation with human judgements than BLEU or ROUGE at the cost of increased complexity [60,141].
BERTScore measures semantic similarity by comparing contextual token embeddings (e.g., RoBERTa base) between the generated text and the reference. It computes cosine similarities at the token level and aggregates them to produce a single score that better captures paraphrase and meaning overlap than surface n-gram metrics [76,141,173].
Perplexity quantifies a model’s uncertainty by exponentiating the negative logarithmic likelihood of the generated sequence:
PPL = exp 1 N i = 1 N log p ( w i )
Lower perplexity indicates that the model predicts the next token with greater confidence [46,62]. Although useful for assessing fluency and coherence, perplexity does not directly measure alignment with retrieved evidence or task-specific correctness.
  • Specialized Diversity & Grounding Metrics
Self-BLEU computes BLEU of each generation against its peers to quantify diversity (lower Self-BLEU to higher diversity) [152,174].
chrF++ evaluates character-level F-measure over character n-grams, capturing fine-grained similarity in morphologically rich settings [163].
Self-TER (Translation Edit Rate) measures the average edit distance between multiple outputs, thus quantifying novelty [76].
Support labels each generated claim fully, partially, or not supported by the retrieved evidence, ensuring factual grounding [96].
Rare F 1 and Predicted Knowledge F 1 (PKF1) focus on specialised tasks: Rare F 1 emphasises performance on low-frequency tokens, while PKF1 gauges the model’s ability to recover explicit knowledge sentences [140].

5.3.2. Automatic Retrieval Metrics

Effective retrieval is a prerequisite for high-fidelity generation in RAG systems. Automatic retrieval metrics quantitatively assess how well the retriever component selects and ranks relevant documents from a large corpus for a given query. In general, these metrics fall into (1) set-based measures, which evaluate the accuracy and completeness of the retrieved set, (2) ranking-based measures, which assess the ordered quality of the retrieval, and (3) hit-based measures, which capture the presence of any relevant document within a specified cut-off point.
Retrieval Accuracy computes the proportion of queries for which all retrieved documents are relevant, relative to the gold standard for relevance judgements. By directly evaluating whether the retriever selects exclusively pertinent documents, the accuracy of document retrieval gauges the binary correctness of the retrieval set, a fundamental prerequisite for downstream generation [155].
Precision@k is defined as the fraction of the top k retrieved documents that are relevant. Measures the system’s ability to avoid including irrelevant items among its highest-ranked results [151,175].
Recall@k is the fraction of all relevant documents that appear within the top k positions, thereby capturing the completeness of the retrieval [63,151]. Together, they offer complementary views: precision penalises false positives at high ranks, while recall penalizes false negatives within the cutoff.
F1@k is the harmonic mean of Precision@k and Recall@k, defined as
F 1 @ k = 2 × Precision @ k × Recall @ k Precision @ k + Recall @ k .
This balanced metric mitigates trade-offs between precision and recall, providing a single score that reflects both accuracy and completeness of the top-k retrieval [151].
Mean Average Precision (MAP@k) averages the precision scores computed at each rank position where a relevant document occurs, then aggregates over all queries. Formally, for each query q,
AP @ k ( q ) = 1 N q i = 1 k P ( i ) 1 { doc i is relevant } ,
where N q is the number of relevant documents for q and P ( i ) is precision at rank i, and MAP@k is the mean of AP@k over q [175,176]. MAP@k rewards retrieval sets that place relevant documents early and penalises late retrievals.
Mean Reciprocal Rank (MRR@k) focusses solely on the rank of the first relevant document. For each query, it computes the reciprocal of the rank position of the first relevant hit (capped at k) and then averages over queries:
MRR @ k = 1 | Q | q Q 1 min ( rank q , k ) .
It is particularly informative when downstream tasks depend critically on the earliest relevant context, as in ODQA [60,176].
Normalized Discounted Cumulative Gain (nDCG@k) accommodates graded relevance by weighting each retrieved document gain by a logarithmic discount based on its position, then normalising by the ideal DCG. It is defined as
nDCG @ k = i = 1 k 2 rel i 1 log 2 ( i + 1 ) IDCG @ k ,
where rel i is the relevance grade of the ith document and IDCG@k is the maximum possible DCG@k [81,162]. nDCG@k is well suited to scenarios with multiple relevance levels or varying document importance.
R-Precision sets the cutoff R equal to the total number of relevant documents for a query and computes precision at that rank:
R - Precision = Precision @ R .
Adapting the cut-off to the relevance count of each query, R-Precision offers a query-specific summary of ranking quality. It forms a core component of composite benchmarks (e.g., KILT) that jointly evaluate retrieval and generation [61,87].
Hit@k is a binary metric that indicates whether at least one relevant document appears within the top K positions; it is averaged over queries to produce a success rate [132]. Hit Success Ratio (HSR) similarly counts the proportion of queries that require external knowledge for which the retriever provides supporting evidence, highlighting the dependence of the model on the retrieved context [86].
Beyond standard relevance metrics, some studies measure the model’s ability to decide whether retrieval is necessary (that is, the accuracy of retrieval abstention) or to withstand adversarial passages (adversarial success rate) [77,177]. These metrics inform selective retrieval policies and robustness evaluations.
Using a combination of these metrics, set-based, ranking-based, and hit-based, researchers obtain a multifaceted understanding of retrieval effectiveness. This rigour in evaluating the retriever component is critical to ensuring that RAG systems have reliable and comprehensive access to external knowledge.

5.3.3. Other Automated Metrics

In addition to standard metrics, a diverse set of other automated metrics has emerged to target specific facets of RAG that are not captured by general purpose measures. These include computational efficiency, robustness, bias, and domain- or task-specific criteria. Because each metric addresses a narrow aspect of system behaviour or relies on specialised evaluation procedures, they appear only occasionally in the RAG literature, typically in studies with unique experimental setups or domain constraints. Their limited adoption reflects both the implementation overhead and the context-specific validity of the measures.
  • Computational Efficiency
Latency quantifies the time to retrieve documents and generate text, often decomposed into retrieval time ( T r ), decision time ( T d ), and generation time ( T g ), with speedup (SU) defined as the relative reduction in total latency compared to a baseline of always retrieving [73,162]. Response Time measures the end-to-end delay from query submission to first token output, a critical factor in interactive and clinical settings [126,127]. These metrics are crucial for real-time applications, where user experience and operational feasibility depend on prompt responses. However, their computation depends on controlled hardware environments and precise logging, which limits cross-study comparability.
  • Robustness & Error Handling
Hallucination Rate tracks the frequency or density of fabricated content in generated responses, either as hallucinations per 100 words or as the proportion of faulty outputs [56,98,126,161]. Rejection Rate (Reject Rate) measures the system’s ability to refuse answers when the knowledge base is insufficient, thus avoiding hallucinations [159,177,178]. Success Rate evaluates the success of adversarial jailbreak attempts, reflecting the vulnerability under malicious prompts [179]. These metrics are indispensable for safety-critical domains (e.g., medicine, law), yet they demand rigorous annotation protocols or adversarial testing frameworks, constraining their routine use.
  • Contextual Bias
Contextual Bias measures the tendency of a model to adopt incorrect assumptions from a misleading context, even when its internal knowledge would suggest a correct response [126,180]. This metric surfaces subtle failure modes of RAG pipelines, particularly when retrieval yields noisy passages, but requires carefully crafted bias scenarios, which are rarely standardised.
  • Image- and Code-Specific Metrics
CIDEr & SPICE evaluates generated image captions by assessing consensus-based textual agreement or semantic propositional fidelity against human references [82,92,170]. Edit Similarity (ES) computes 1 Lev ( Y ^ , Y ) max ( | Y ^ | , | Y | ) , where Lev is Levenshtein Distance, to quantify token-level similarity of code snippets [73,181]. Pass@k measures the proportion of code generation attempts that pass automated test suites within k trials [50,147]. CodeBLEU extends BLEU by incorporating abstract syntax tree and data flow comparisons, capturing both syntactic and semantic correctness of code [167,181]. These task-specific metrics yield deep insights in their respective domains but lack generalisability: captioning and code generation each demand bespoke reference datasets, execution environments, or parser toolchains.
  • Performance Comparison
Comparative Metrics quantify improvements over baseline systems (e.g., KRAGEN vs. BioGPT/OpenChat in biomedical QA) by aggregating multiple performance indicators into a single comparative score [71,148]. Although succinct, such composite measures often obscure which individual components drive gains and presuppose the availability of strong baselines in the target domain.
  • Discussion & Recommendations
These other automated metrics, while rarely applied in general RAG research, play a pivotal role in specialised studies by illuminating efficiency, safety and domain-specific quality attributes. Their sporadic use stems from (1) the high cost of bespoke dataset creation or annotation; (2) dependencies on hardware and execution environments; and (3) the lack of universally accepted standards for task-specific evaluation. To enhance comparability and encourage broader adoption, we recommend the following.
  • Modular Reporting: Package each specialised metric within containerised pipelines to facilitate deployment.
  • Benchmark Extensions: Propose extensions to popular RAG benchmarks (e.g., adding hallucination annotations to QA datasets).
  • Open-Source Toolkits: Contribute wrappers for less common metrics, such as ES and contextual bias, to public evaluation libraries.
By situating these metrics alongside standard automated measures in future studies, researchers can achieve a more holistic assessment of RAG systems without imposing prohibitive setup costs.

5.3.4. Human Evaluation Metrics

Human evaluation remains indispensable for assessing aspects of RAG that escape purely automatic measures. By soliciting judgments on dimensions such as correctness, relevance, fluency, and factuality, researchers gain insight into real-world performance and user impact [141,182].
  • Correctness & Accuracy
Accuracy gauges the degree to which the generated outputs match the expert-validated answers. In the clinical settings of RAG, evaluators verify whether the responses of the model reflect consensus recommendations [183]. Legal RAG evaluations similarly require that each response be both factually correct and properly grounded in authoritative sources [98]. Educational chatbots assess ‘correctness’ using multipoint rating scales applied by subject matter experts [66,142].
  • Relevance
Relevance measures how well the context retrieved or the generated text aligns with the user’s query. Human raters typically score summaries or answers on a binary or Likert scale for topical pertinence, grammatical coherence, and external information appropriateness [141]. In personalised RAG frameworks, relevance judgements of retrieved passages ensure that augmentation truly addresses user intent [166].
  • Hallucination & Groundedness
Hallucination metrics capture instances of fabricated or misattributed content. Annotators label responses as ‘Extrinsic’ (not supported by any input), ‘Intrinsic’ (incorrectly synthesised from input) or ‘Misgrounded’ (false citation) [98,172]. Human evaluation thus directly quantifies the tendency of the model to invent facts, a critical safety concern in high-stakes domains such as healthcare care and law [89,140].
  • Factual Correctness & Consistency
Beyond binary correctness, human judges assess whether a response maintains internal consistency, avoids contradictions, and remains factually accurate throughout longer interactions [45,140]. This qualitative lens captures subtle semantic errors that are not detected by overlap metrics.
  • Comprehensiveness & Quality
Comprehensiveness evaluates depth of coverage: whether the generated text addresses all aspects of a query [131]. General quality scales (for example, 1 to 5 points) combine relevance, coherence, and absence of typos, resulting in a single interpretable score [168,184].
  • User-Centric Metrics
User Satisfaction is measured via post-interaction surveys; satisfaction scores reflect perceived usefulness and clarity [126,127].
System Usability (SUS): a standardised 5-point questionnaire assesses accuracy, clarity, relevance, and ease of understanding [185].
Technology Acceptance (TAM): structures such as perceived usefulness and ease of use are quantified through validated survey instruments, offering insight into the likelihood of adoption [59].
  • Annotation Protocols & Reliability
Most studies use three to five human annotators to rate system outputs against predefined criteria. Common protocols include Likert scales (3–5 points) to assess relevance, fluency, and factuality [141,182]; binary judgements (yes/no), particularly for retrieval relevance or groundedness [98,166]; comparative judgements (win/tie/loss) for head-to-head model comparisons [119]; and error classification, in which incorrect outputs are sampled and error types are categorised (e.g., reasoning versus retrieval failures) [68]. To support reliable annotation, studies typically provide clear guidelines with worked examples for each rating level, pilot the scheme on a small subset and refine the instructions, and report inter-annotator agreement (e.g., Cohen’s κ ), including both raw agreement and chance-corrected statistics [98].
  • Strengths, Limitations & Recommendations
Human-judged metrics capture nuanced aspects of RAG output, such as hallucination, conversational coherence, and user trust, that automated measures often miss. However, they are time-intensive, costly, and susceptible to annotator bias, with inter-annotator agreement frequently below κ = 0.7 , reflecting subjectivity in complex judgements [98].
To maximise rigour and reproducibility, evaluations should combine measures spanning core dimensions (e.g., accuracy, relevance, hallucination, comprehensiveness, and satisfaction), report annotation scales, rater qualifications, and agreement statistics transparently, and consider hybrid designs that supplement expert judgements with carefully prompted LLM-as-judge procedures to increase scale while retaining depth. Making annotation guidelines and code openly available further facilitates external replication and community benchmarking.
When protocols are defined a priori, each metric is grounded in previous work and reliability is reported, the human evaluation section can more convincingly demonstrate both the real-world viability and the limitations of an RAG system.

5.3.5. LLM-As-Judge Metrics

Recent advances in evaluation methodologies have shifted toward the use of LLMs themselves as automated judges of generated content. Rather than relying solely on surface-level overlap or costly human annotation, LLM-as-judge approaches prompt a high-capacity model, such as GPT-4, to assess outputs along dimensions such as correctness, relevance, coherence, and safety.
  • Accuracy via Advanced LLM Verification
One common formulation applies an LLM (e.g., text-davinci-003) to re-evaluate model outputs against ground-truth answers, flagging semantically correct yet lexically divergent generations as accurate [49]. This “LLM-verified accuracy” provides a more robust correctness estimate than exact-match metrics, particularly in question-answering settings where paraphrasing is common.
  • GPT-Based Correctness and Quality Ratings
A suite of studies instruct ChatGPT or GPT-4 to assign binary or scalar judgements to outputs:
Binary correctness: ChatGPT classifies each response as correct or incorrect, yielding a proportion-correct score [177].
Quality scales: Responses are rated on a 1–10 scale for overall quality by ChatGPT [177] and GPT-4 across multiple facets (relevance, clarity, depth) in fully automated scoring systems [66].
Sentiment assessment: ChatGPT assesses the polarity of model outputs (positive vs. negative) to gauge tone and user experience [177].
  • Benchmarking Against GPT-4 Judgements
To validate internal model evaluations, some works compare their own LLM’s judgements with those of GPT-4. For example, GPT-4 is used as a reference judge for self-knowledge, passage relevance, and question-decomposition tasks, establishing a reliability benchmark [182].
  • Harmfulness and Safety Classification
Ensuring ethical outputs, researchers prompt GPT-4 to detect and classify harmful or toxic content, computing the proportion of harmful responses or the worst-case toxicity over multiple samples [130]. This approach complements traditional toxicity metrics by leveraging the LLM’s contextual understanding of offensiveness.
  • LLM-Fact-Checker Chains
Leveraging frameworks such as LangChain, an LLM (e.g., gpt-3.5-turbo) is embedded in a fact-checking pipeline: it cross-verifies chatbot responses against course content or reference materials and generates confusion-matrix statistics (accuracy, precision, sensitivity, specificity) to automate what was formerly manual evaluation [59].
  • G-EVAL: Comprehensive LLM-Judged Evaluation
G-EVAL uses GPT-4 to score generated text on coherence, consistency and fluency using a 1–5 rubric, outperforming traditional overlap metrics in correlating with human judgements [186]. It has been used to evaluate the generation of domain-specific reports, such as flood incident summaries, demonstrating superior alignment with expert evaluators [186].
  • Semantic Accuracy via LLM Instruction Models
By prompting gpt-3.5-turbo-instruct to compare generated answers semantically against ground truths, “semantic accuracy” metrics capture meaning preservation beyond exact tokens, addressing limitations of classical exact-match scores [142].
  • Discussion & Recommendations
LLM-as-judge metrics offer scalable, semantically rich evaluation but inherit potential biases and prompt-sensitivity from their host models. To mitigate these issues, we recommend calibrating LLM prompts against a small human-annotated validation set, reporting multiple perspectives (e.g., combining binary correctness with a scalar quality score) and disclosing prompt templates and model versions to ensure reproducibility. Adopting these practices can harness the efficiency of LLM-judged evaluation while maintaining rigorous, transparent assessment standards.

5.3.6. Automated Frameworks

Automated evaluation frameworks are pivotal for assessing RAG systems by mitigating subjectivity, scalability issues, and bias. Two notable systems, ARES [187] and RAGAS [188], concentrate on three core metrics: context relevance, answer faithfulness, and answer relevance.
ARES adopts a quantitative approach, using fine-tuned language models and Kendall’s τ to align automated scores with human judgements [187]. This method delivers high precision and nuanced insights into response fidelity; however, its reliance on extensive annotated data may restrict scalability. In contrast, RAGAS employs a reference-free strategy that uses cosine similarity to measure semantic relationships between queries, retrieved contexts, and generated responses [188]. Although this technique improves objectivity and accelerates evaluations, it is more sensitive to prompt variations, which can reduce consistency.
ARES and RAGAS thus represent two contrasting yet complementary approaches to RAG system evaluation. ARES offers detailed, human-aligned assessment but can be hampered by scalability issues due to its dependency on annotated data. Conversely, RAGAS provides operational efficiency through automated semantic similarity measurements, albeit with potential variability due to prompt sensitivity. This juxtaposition highlights the trade-off between detailed, qualitative insights and streamlined quantitative evaluation, prompting critical questions about whether future frameworks might integrate the strengths of both methods to achieve a balanced, robust evaluation strategy.
It is important to note that automated evaluation frameworks relying on large language models are not immune to inherent biases, which can subtly skew outcomes and misrepresent true system performance. To mitigate these issues, future evaluation strategies could benefit from hybrid approaches that integrate LLM-based assessments with calibrated human oversight, balancing the objectivity and scalability of automated methods with the nuanced insights of human evaluators.
Practical implications of these frameworks include guiding the design of adaptable RAG systems that lower annotation costs while upholding rigorous evaluation standards. Future research may further integrate qualitative elements and refine metrics to address emerging concerns, as discussed in Section 5.3.4. Ultimately, merging ARES’s detailed human insight with RAGAS’s operational efficiency may offer the most balanced strategy to advance the evaluations of the RAG system.
For practitioners, a compact, defensible rubric emerges. Track at least: (i) one retrieval metric prioritising early relevance (e.g., MRR@k or nDCG@k) and one coverage metric (Recall@k); (ii) one answer-level metric tolerant to paraphrase ( F 1 or BERTScore) and one strict metric when exact strings matter (EM); (iii) a grounding/faithfulness signal (claim support or citation correctness, plus hallucination rate or abstention rate); (iv) user-facing quality via either small-sample human ratings (correctness, relevance, comprehensiveness, clarity) or a validated LLM-as-judge replica with spot-checked agreement; and (v) efficiency and reliability (end-to-end latency, tokens per answer, retrievals-per-answer, and a minimal robustness probe against distractors/adversarial context). Reporting these five pillars makes results both comparable and actionable.
At the same time, the rubric should be read as an instrumentation layer rather than as a guarantee of system behaviour. It foregrounds dimensions that are comparatively easy to observe, such as answer correctness, groundedness, robustness to distractors, and simple security checks. However, it leaves other aspects of deployment only partially covered. For example, the suggested indicators do not attempt to quantify downstream harms such as unequal treatment of different user groups, leakage of sensitive information, or failure to meet domain-specific regulatory obligations, nor do they prescribe how different kinds of failure should be prioritised in clinical, financial, or legal settings. The design also implicitly assumes that teams can run human assessment, LLM-based judging, robustness experiments, and periodic security probes on a regular basis; many production systems or smaller organisations will instead need sampled evaluations, proxy signals, or coarse-grained dashboards, and the framework does not yet spell out principled reductions for those conditions. Finally, the rubric is largely shaped by the predominantly English, text-centric QA and summarisation workloads represented in current RAG benchmarks, with only indirect coverage of interactive agents, long-horizon conversational use, or multimodal pipelines that mix text with images, code, tables, or time series. As models, datasets, and governance expectations evolve, both the choice of metrics and their relative emphasis will need to be revisited, so the proposed scheme should be regarded as a structured starting point rather than a definitive standard.

5.3.7. Holistic Evaluation of RAG Benchmarks

RAG benchmarks, although diverse in their origins and target domains, collectively map out a multidimensional landscape of model performance. At the core, each benchmark isolates particular capabilities—be it resilience to noise, financial forecasting acuity, medical question precision, multi-hop reasoning, or CRUD-style text operations—and in doing so, they both complement and challenge one another.
  • Connecting the Four Pillars of RGB to Broader RAG Metrics
The Retrieval-Augmented Generation Benchmark (RGB) explicitly dissects RAG capability into noise robustness, negative rejection, information integration, and counterfactual robustness [178]. These four axes are not arbitrary: they represent the basic dilemma of “when and how to trust retrieved context.” Noise robustness measures whether a model can sift signal from distractors; this is a requirement shared by nearly all other RAG tasks, since any retrieval pipeline may surface irrelevant documents [178]. Negative rejection, on the other hand, examines the model’s restraint: ability to say “I don’t know” rather than hallucinate. This restraint is critical in high-stakes domains such as medicine, where wrong answers can mislead practitioners [65]. Information integration overlaps naturally with multi-hop retrieval and summarization: it probes the model’s capacity to aggregate evidence from multiple sources, akin to what MultiHop-RAG quantifies through MAP@k and answer accuracy [176]. Finally, counterfactual robustness examines error detection and correction—an echo of CRUD-RAG’s “Update” task, which tests factual revision in generated text [189].
  • Quantitative Meets Qualitative: Trade-Offs in Evaluation
While RGB relies primarily on exact match accuracy and rejection/error rates, AlphaFin extends evaluation to financial performance metrics: annualized rate of return (ARR), Sharpe ratio, drawdowns, along with traditional language metrics like ROUGE and human preferences [90]. This duality highlights a fundamental trade-off: quantitative metrics (ARR, MAP@k, accuracy) offer objectivity and comparability, yet may miss subtleties of fluency, coherence, or interpretability that qualitative human studies and chain-of-thought evaluations capture. For example, a model that achieves high ARR by blindly following market trends may still produce explanations that fail regulatory standards or mislead users; here, GPT-4 preference judgements in financial Q&A illuminate whether the model’s reasoning is human-aligned [90]. In contrast, purely qualitative assessments can be subjective and difficult to standardise in large test beds such as the 7663 medical questions of MIRAGE [65].
  • Domain-Specific Demands and Broader Trends
The emergence of domain-tailored benchmarks as seen with AlphaFin and MIRAGE reflects a broader shift in RAG research: One-size-fits-all evaluation is giving way to specialised suites that capture domain nuances. In medicine, zero-shot versus retrieval-augmented evaluations in MIRAGE reveal that RAG can increase accuracy by up to 18%, but also surface ‘lost in the middle’ issues when too much context overwhelms the model [65]. MultiHop-RAG similarly shows that retrieval itself remains a bottleneck: even GPT-4 reaches only 56% accuracy with real retrieval versus 89% with ground-truth contexts [176]. These findings spark questions: How might improvements in retriever architectures reorder the current performance hierarchy? And can domain-agnostic LLMs ever match domain-specific ones once retrieval pipelines are fully optimised?
  • Methodological Reflections: Why These Metrics?
Each benchmark’s metric choices go back to its core use case. The rejection rate metric of RGB emerges from the need to induce hallucinations in open domain QA, while the AlphaFin ARR and Sharpe ratio base the evaluation on financial risk-reward trade-offs [90,178]. The reliance of MIRAGE on established medical QA datasets (MMLU-Med, MedQA-US, BioASQ) ensures comparability with previous work, but by layering the retrieval into zero-shot and multichoice settings, it exposes where medical LLMs overuse or underuse external evidence [65]. The combination of retrieval metrics (MAP@k, MRR@k) and generation (accuracy) of MultiHop-RAG mirrors the two-stage reality of the RAG pipelines, allowing separate diagnostics for the retriever and the generator [176]. The taxonomy of CRUD-RAG in the Create, Read, Update, Delete tasks underscores the need for full-lifecycle assessment of text operations, not just answering questions [189].
  • Practical Implications and Future Directions
In practice, these benchmarks guide system design: a retrieval pipeline optimised for MAP@10 may not yield the best error correction performance in counterfactual settings; a model fine-tuned for ROUGE in financial summaries could underperform in drawdown mitigation metrics. Thus, practitioners face calibration challenges: Which trade-off between retrieval depth and generative precision aligns best with their application’s risk profile?
Looking ahead, several avenues merit exploration. First, integrating qualitative fluency measures directly into quantitative benchmarks could bridge the gap between human-centric evaluation and automated metrics. Second, extending benchmarks to multilingual or cross-modal contexts (for example, combining text with tables, charts, or code) would reflect real-world uses. Finally, as interactive RAG agents grow, dynamic benchmarks that simulate user feedback loops will be critical to measure adaptability and continuous learning.
RGB, AlphaFin, MIRAGE, MultiHop-RAG, and CRUD-RAG form a tapestry of complementary benchmarks: each covers a slice of the performance spectrum: signal filtering, domain-specific reasoning, error detection, evidence synthesis, and text lifecycle operations. Their varied metrics—accuracy, rejection rates, ARR, ROUGE, MAP@k, Sharpe ratios—highlight that no single number suffices. A holistic evaluation demands a suite of metrics that reflect both quantitative rigour and qualitative nuance. As RAG systems advance, our benchmarks must evolve in tandem, posing ever more challenging questions: Can we craft unified metrics that capture trustworthiness, utility, and user alignment in one framework? Only through such integrative efforts can the next generation of RAG applications realise their full potential.

5.3.8. Datasets

In our systematic survey, we find that researchers have used a large array of approximately 343 unique datasets to evaluate RAG systems, illustrating the multifaceted nature of performance assessment. Open-domain resources such as Wikipedia [1], Natural Questions [190], and MS MARCO [191] provide a baseline, particularly for question-answer tasks. These datasets excel in benchmarking fluency and general comprehension, but may not fully represent specialised applications. In contrast, domain-specific collections, ranging from legal (e.g., ALQA [192], LEDGAR [193]) to biomedical sources (e.g., CORD-19 [194], KGRAGQA [195]), offer in-depth evaluation in high-stakes contexts, although they often suffer from inconsistent preprocessing and versioning practices. Table A3 summarises the content description and intended use of these datasets.
Multi-hop QA sets, including HotPotQA [196] and 2WikiMultihopQA [197], challenge systems with complex reasoning tasks, highlighting strengths in multistep inference, while also revealing limitations in current methodologies. Similarly, multimodal and code-centric corpora, such as COCO [198] for image–text pairs and CodeSearchNet [199] for code-centric evaluations, extend performance evaluation beyond traditional text, addressing broader application domains, yet introducing variability due to differences in data segmentation and annotation standards.
This diversity reflects both advantages and trade-offs: Although open domain datasets support benchmark consistency, specialised datasets provide critical insight into domain-specific challenges [1,190,191,193]. The absence of standardised dataset preparation, ranging from segmentation to versioning, poses a significant methodological challenge and raises questions about the reproducibility and comparability of RAG evaluations. For example, how might emerging frameworks for dataset processing and standardised evaluation metrics improve consistency between studies?
The interplay among these datasets underscores a broader trend toward holistic, multidimensional evaluation strategies in the development of the RAG system. By integrating both quantitative benchmarks and qualitative assessments, researchers can better capture the strengths and limitations of current models, ultimately guiding future innovations and establishing more robust operational standards.

5.4. What Are the Key Challenges and Limitations Associated with Retrieval-Augmented Generation Techniques?

RAG often fails for reasons that are joint across retrieval and generation. Across our survey corpus, five challenge families recur: evidence quality (noise, stale or mislinked data, domain shift, multimodal misalignment); pipeline coupling (error cascades spanning retrieval then reranking then generation and memory); resource budgets (latency, tokens, k, index size); LLM constraints (context windowing, brittleness, bias/hallucinations); and security (poisoned corpora, prompt/context injection, policy evasion). These are interdependent: small retrieval imperfections can be amplified by the generator; tighter budgets exacerbate noise sensitivity and hallucination; and any residual weakness enlarges the adversarial surface.

5.4.1. Noise, Heterogeneity, and Multimodal Alignment

RAG pipelines are only as good as their inputs, yet most inputs are noisy and heterogeneous. Vision-to-Language transformers compress complex scenes into terse captions, suppressing spatial clues such as gaze or depth [86]. Code-Property graphs balloon super-linearly with project size, so aggressive pruning saves space but can excise rare, security-critical constructs [51]. Selective densification, reinjecting previously filtered snippets when retrieval confidence dips, offers a middle ground, although it still inflates indices [64,125].
Noise also lurks in hybrid retrieval itself. Dense vectors, sparse keywords and rule filters score on incompatible scales; naive normalisation swings between overrecall and underrecall, while cross-encoders that fix the problem add 2–5 times the latency [174,200]. Learnable weighting gates are promising but lack cross-domain evidence [153]. Multimodal encoders introduce another layer of fragility: CLIP-style models often suffer “semantic bleeding”, where irrelevant visual regions influence text similarity, a serious risk in radiology and surgical robot logs [52,82,125,173]. Fine-grained alignment losses mitigate leakage but add both milliseconds and supervision cost. Lightweight validation schemes, such as attention entropy checks or cross-view consistency regularisers, offer a protection at marginal run-time cost and do not require dense pixel labels [125].
Finally, knowledge graph retrieval excels in multi-hop reasoning, yet depends on noisy entity linking and heuristic pruning; over-pruning deletes long-tail nodes, under-pruning explodes memory, and classic graph metrics correlate only weakly with downstream QA [57,60,153]. What the field needs is learnable fusion frameworks that expose per-channel uncertainty and graph-aware benchmarks that reveal the real cost-benefit envelope of noise mitigation strategies.

5.4.2. Domain Shift, Dataset Alignment, and Generalisation

Our focus now shifts to distributional robustness—RAG models that shine in one domain often stumble in another. Systems tuned to PubMed outperform BM25 on biomedical queries but falter in legal corpora without costly retraining [93,133]. Hybrid pipelines that anchor language-agnostic schemas with thin domain-specific rules travel better, but add engineering overhead and still require careful calibration when knowledge is fragmented across disconnected sources [182].
Repository freshness compounds the problem. Stale or erroneous material propagates directly into answers, a high-stakes liability in finance and medicine [160,173]. Index refreshes mitigate drift but demand labour-intensive validation pipelines and may introduce their own lag [71]. Worse, most evaluation sets lean heavily on English-language Wikipedia, masking specialist failure modes and inflating scores through train–test overlap [70,200]. Corpus choice is thus decisive: biomedical encoders dominate on PubMed but misfire elsewhere [65], and multilingual retrieval remains hamstrung by scarce aligned data and inconsistent terminology [69]. Adaptive “retrieval triggers” that fire the retriever only when the generator signals high uncertainty appear attractive; yet, when they misfire, they either waste compute or omit indispensable evidence [158].
Seemingly mundane hyperparameters—chunk size, hierarchical fragmentation strategy, the number of documents k to return, and undocumented caching policies—can shift accuracy–latency curves by double-digit margins: small windows fracture discourse; large ones bloat latency; and inconsistent choices of k thwart reproducibility [79,128,153,168,201]. Closing these gaps will require continuous validation pipelines and unified cross-domain, multilingual testbeds that expose real-world brittleness while tracking accuracy–latency trade-offs.
Within this landscape, several patterns are consistent. Retrieval quality is necessary but not sufficient. For example, systems with high Recall@k may still yield unfaithful answers without explicit grounding checks. Many errors present as interface failures: ranking mistakes, over-long or poorly filtered context, or stale memory entries that bias the LLM. Budget awareness matters as much as model choice; practical reliability tracks quantities such as retrievals-per-answer, tokens-per-answer, and index refresh cadence. Security remains first class in RAG because “trusted context” can bypass guardrails.

5.4.3. Modular Pipelines and Error Cascades

Even when knowledge is fresh and well aligned, architectural glue can fail. In this section, we focus on interfaces that link retrieval to generation. Splitting retrieval, reranking, and generation curbs hallucination but creates brittle processing chains. A misranked passage in the first stage can irreversibly bias the generator, and although deep cross-encoders lift ranking fidelity, their compute cost still forces approximate first-pass filters whose scores are tuned ad hoc [87,140,163,202].
Iterative and memory-augmented pipelines add another wrinkle. External memories curb repetition but introduce staleness and snowballing: cached errors are re-retrieved in later turns [71]. Content-based decay, which weights cache entries by both recency and reuse, cuts latency by up to 40% without hurting precision, yet evidence remains limited to small-scale experiments [201]. Ultimately, combining interface patterns that expose calibrated model confidence with uncertainty-triggered safeguards, for example, probability/entropy thresholds that proactively invoke retrieval, verification, abstention, or rollback, can prevent error cascades from taking hold [45,203].

5.4.4. Large-Language-Model Constraints and Safety Risks

Next, we examine the constraints of the generator (the LLM) that produces user-facing text. Commercial LLM APIs deliver strong performance but impose per-token fees, usage limits and a requirement for internet connectivity. Open-weight models avoid vendor lock-in and can run locally, offering broad flexibility, but they typically require substantial hardware and place the burden of optimisation and tuning on the deployer [49,174]. Fixed context windows—often four thousand tokens or fewer—truncate multi-document evidence, forcing lossy chunking that undercuts retrieval depth; long-context variants help, but do not fully restore cross-passage reasoning [170,171].
Bias, toxicity, and hallucinations remain endemic. Encoding a user’s information need in only a few tokens is brittle, and attempts to map that intent into structured formats (e.g., JSON) often break under domain drift [43,144,145]. Automatically generated search strings show the same fragility: ill-formed queries invite off-topic retrieval and can launch an irrelevant evidence cascade [45]. Skewed pre-training corpora, meanwhile, inject demographic bias and toxic completions; retrieval softens but does not eliminate hallucination, and lapses are especially hazardous in medicine [78,89,150]. Prompt design does not offer a silver bullet: minor syntactic edits shift coherence and factuality, while adversarial prompts can bypass guardrails or surface-corrupted evidence [57,61,154,179]. Progress therefore hinges on bias- and hallucination-aware losses, adversarial-prompt test suites, and extended-context architectures that enlarge windows at a sustainable cost.

5.4.5. Security Threats in Retrieval-Augmented Generation

Even the best-engineered and safest pipelines remain vulnerable to deliberate attack, so finally we consider the RAG security landscape. The same external knowledge that makes RAG systems powerful also opens up a new attack surface: the retrieval corpus itself. Because the language model is trained to trust whatever the retriever returns, even a single poisoned document or a carefully crafted query can steer generation, violate safety policies, or leak private data. Recent work exposes three broad threat families: (i) corpus-poisoning back-doors, (ii) data-exfiltration and privacy attacks, and (iii) jailbreak and policy-evasion triggers. Each of these exploit the loose coupling between the retriever and generator.
AGENTPOISON [74] and Phantom [204] show that an attacker needs to tamper with fewer than 0.1% of corpus items—sometimes only one passage—to create a back-door that fires when a secret trigger word appears. A constrained trigger optimisation maps those queries to a compact, unique region of the embedding space, guaranteeing retrieval while remaining stealthy (low perplexity, robust to paraphrase). The result is alarming: across six dense retrievers and multiple LLMs, retrieval success exceeds 80%, and end-to-end malicious action rates sit around 60% with virtually no drop in benign accuracy. These findings underline a systemic weakness: current RAG deployments rarely authenticate or provenance-stamp the documents they ingest, so “sleeper” passages can lie dormant until the attacker issues the right query.
BadRAG [177] extends the idea to content-only poisoning: its COP/ACOP/MCOP techniques craft passages that are only retrieved under specific trigger conditions and then bias the generated output (Alignment-as-an-Attack, Selective-Fact-as-an-Attack). With as few as ten poisoned passages, the framework achieves a 98% trigger-retrieval rate and slashes GPT-4 accuracy from 92% to 19%. Crucially, attacks bypass naive defences such as perplexity filters or keyword blacklists and can even nudge sentiment or political stance without overtly toxic text, highlighting how difficult covert bias detection will be once adversaries understand retrieval scoring.
A different axis of vulnerability is privacy. “Follow My Instruction and Spill the Beans” [205] demonstrates that simply appending a malicious system or user prompt can coerce instruction-tuned models to copy verbatim from their private datastores. Across nine open-source LLMs and 25 production GPTs, the leakage success hit 100% in at most two queries; larger models leaked >70 BLEU points of text. Leakage worsens with coarse, semantically coherent chunks and when prompts are injected at the start or end of the context, painting a clear blueprint for would-be attackers. Mitigations such as PINE (Position-bias Elimination) and safety-aware system prompts reduce but do not eliminate the reconstruction rate, signalling that stronger retrieval-side controls are required.
Pandora [179] and Phantom [204] move beyond bias or leakage to full policy evasion. By injecting adversarial content that the retriever dutifully surfaces, the attacker sidesteps the usual guard-rail prompt hierarchy; GPT-4, normally resilient to direct jailbreaks, yields prohibited outputs in 35% of cases once the supporting evidence comes from a poisoned corpus. Because the unsafe text reaches the generator as “ground truth”, refusal classifiers often let it pass. These results expose an uncomfortable asymmetry: alignment layers supervise prompts, yet poisoned retrieval arrives as “context” and therefore inherits implicit trust.
In practice, commonly deployed defences provide only limited protection. Perplexity-based filtering and query rephrasing reduce AgentPoison’s end-to-end success by at most single-digit percentage points in some tasks (e.g., 9.6 percentage points and 6.8 percentage points in Agent-Driver), but do not produce any reduction—and sometimes an increase—in others (ReAct-StrategyQA). Moreover, AgentPoison’s triggers remain low-perplexity and thus difficult to flag [74,177]. Query rephrasing or majority vote reranking is similarly ineffective because trigger optimisation tends to cluster poisoned queries tightly in embedding space; paraphrases remain within the backdoor region. Safety prompting and refusal classifiers cannot, in general, distinguish benign evidence from adversarially retrieved content, and therefore authorise harmful completions [179]. Blacklisting triggers is also brittle: Phantom shows that an unseen synonym can reactivate the attack, and adversaries can optimise entirely new token sequences that were not present at the defence time [204].
These limitations motivate a set of complementary mitigations. Strengthening corpus provenance and attestation is a priority: practical mechanisms to sign, version, and audit documents in large-scale vector stores remain scarce, but append-only logs based on Merkle trees, together with proofs from trusted execution environments, could make retroactive poisoning detectable. Retrieval-time anomaly detection also merits attention; distance-based or density-ratio detectors in embedding space may identify outlier triggers, provided they operate at millisecond latency and resist adaptive manipulation. A further avenue is joint retriever–generator training: current pipelines typically “freeze” the retriever at deployment, coupling the retriever’s gradients to downstream safety losses may instead lead the system to unlearn reliance on poisoned sources. In parallel, refusal mechanisms should assess the provenance of retrieved spans—not only the prompt—so that unsafe evidence is withheld before it reaches the LLM. Continued progress will depend on rigorous benchmarking, since most leaderboards emphasise hallucination and factuality rather than integrity; a standard suite that measures attack-specific metrics (e.g., retrieval attack success rate, end-to-end ASR under transfer (ASR-t), and drift in benign accuracy) would enable systematic evaluation.
Security threats in RAG are no longer theoretical. With a handful of poisoned passages or a single prompt injection, adversaries can bias, leak, or jailbreak state-of-the-art systems while evading current defences. The community must therefore treat the retrieval and, by extension, the knowledge base, as a first-class security boundary, on a par with the language model itself.
A defensible, low-overhead baseline therefore adds a retrieval-abstention path (e.g., entropy or uncertainty threshold) and reports rejection rate; incorporates a lightweight listwise or RRF reranker; controls k with a token budgeter (trim/merge/compress) while logging tokens-per-answer; runs a small verifier to label claim support (supported/partial/unsupported) and surface citations; refreshes indices on a fixed cadence and tracks dataset freshness; enables corpus provenance logging (ID, source, version) while excluding untrusted sources from the “trusted” index; and publishes a system card with latency, retrievals-per-answer, and a minimal robustness probe (distractors, adversarial passages).
These choices carry limits and conditions. Abstention improves safety but lowers coverage; reranking improves precision but adds milliseconds; compression and k control reduce cost but risk “lost-in-the-middle”; verifiers and LLM-as-judge can inherit bias; provenance requires process and infrastructure; adversarial defences degrade under adaptive attackers; and cost or robustness numbers are not comparable without disclosures on hardware and prompt/versioning.

5.4.6. Synthesis and Outlook

Progress hinges on co-design: latency-aware scheduling across retrieval and generation; benchmarks that jointly score robustness to noise, distribution shift, and security; extended-context models balanced by adaptive retrieval depth; and probabilistic defences that propagate calibrated uncertainty end-to-end. Tackling these dependencies together will yield RAG systems that are efficient, reliable, and resilient amid rapidly evolving knowledge and threats.
  • Concluding Remarks
These findings argue for budget-aware, provenance-aware RAG systems: retrieval depth and tool use should be governed by explicit policies; corpora must be signed and auditable; and reports should pair accuracy with latency, energy, and robustness. Standardised, multilingual and multimodal benchmarks that integrate retrieval and generation quality with safety and cost will enable more meaningful comparisons and accelerate progress from prototypes to dependable, domain-ready RAG.

6. Future Work

In light of the findings associated with RQ1–RQ4, this review identifies several interconnected and strategically important avenues to advance research on RAG.
A first priority concerns the development of holistic and standardised evaluation frameworks capable of jointly assessing retrieval quality, answer correctness, grounding, efficiency, robustness, and system integrity. Contemporary evaluations remain dominated by surface-level task metrics (e.g., EM/ F 1 , BLEU/ROUGE) supplemented by limited retrieval diagnostics, while cost, latency and safety are often reported only informally. Future benchmark suites should therefore: (i) systematically couple answer-level scores with retrieval metrics such as Recall@k, MRR@k, and nDCG@k; (ii) incorporate explicit measures of groundedness, hallucination, abstention behaviour and lightweight robustness probes capable of revealing susceptibility to distractors or adversarially constructed contexts are included; and (iii) mandate reporting of system-level indicators, including tokens-per-answer, retrievals-per-answer, latency, and associated monetary or energy expenditure. Extending these benchmarks to multilingual and multimodal settings, where quality, efficiency, and integrity must be co-optimised, would more closely mirror the demands of real-world deployments.
A second research direction arises from the transition from monolithic, single-pass pipelines toward modular and policy-driven RAG architectures. This shift highlights the need for budget-aware and uncertainty-aware retrieval policies that explicitly co-design retrieval depth alongside generation. Although recent systems demonstrate that selective retrieval, re-ranking, and context compression can retain most of the performance benefits of dense or hybrid RAG while substantially reducing token and compute budgets, these strategies remain unevenly evaluated and loosely formalised. Future work should treat retrieval depth, tool usage, and context length as optimising resources. For example, by learning policies that trigger retrieval or abstention only under calibrated uncertainty, allocate fixed budgets across pre- and post-retrieval stages, and dynamically adapt k to task difficulty and latency constraints. Integrating lightweight verification and re-ranking into these policies would enable systems to optimise what is retrieved and what is forwarded, thereby transforming accuracy, faithfulness, and latency into explicit control parameters rather than incidental outcomes. Comparative analyses across hardware platforms and cloud providers would further clarify the operational consequences of architectural design choices.
A third priority concerns the still-narrow empirical evidence base for RAG, which remains disproportionately concentrated in English open-domain QA and a limited set of high-resource, high-signal domains such as software engineering and clinical medicine. Progress toward reliable domain-sensitive systems will require sustained work on domain-specific, multilingual, and multimodal RAG under realistic constraints, particularly in settings where provenance and safety are paramount (e.g., clinical decision support, financial analysis, legal reasoning, and other safety-critical environments). Promising directions include systematic evaluations using domain-governed corpora; RAG pipelines for low-resource and morphologically complex languages, where retrieval, translation, and grounding interact non-trivially; and structure-aware retrieval frameworks that integrate heterogeneous data types—including tables, code, graphs, time-series, and images—alongside text. For each of these domains, researchers should report not only accuracy gains, but also failure modes, drift phenomena, and the operational burden associated with curating, refreshing, and governing the underlying corpora.
A fourth priority concerns the distinctive security and provenance vulnerabilities introduced by retrieval-based systems. Corpus poisoning, prompt-in-context manipulation, and data-exfiltration attacks illustrate that even small perturbations to an index can steer or compromise RAG pipelines. Addressing this requires a coordinated research agenda on security, provenance-aware, and robustness-driven RAG. Key directions include methods for corpus attestation and versioning in large vector stores; scheduled index refresh with provenance tracing; fast anomaly and backdoor detection in embedding space; and retriever–generator training paradigms that reduce reliance on untrusted or suspicious evidence. Evaluation protocols should report attack success rates alongside benign performance metrics. Closely connected is the broader challenge of governing long-lived memory and user-specific knowledge stores, including principled retention, revocation, auditing mechanisms, and user-level memory governance frameworks.
A fifth direction emerges from the rise of agentic and memory-augmented RAG architectures, which suggests the need for more integrated and policy-driven systems. Rather than treating retrieval, memory, tools, and generators as loosely coupled components, future research should explore agentic controllers that coordinate planning, execution, and reflection across the entire pipeline: structured chunking; hybrid or graph-based retrieval; re-ranking and compression; persistent memory; verification and citation; and decision-making conditioned on task uncertainty, evidence quality, and cost constraints. Such systems should learn when to retrieve, what to retrieve, and how much evidence to propagate; combine heterogeneous evidence under explicit provenance constraints; and trigger abstention, additional retrieval, or lightweight critique as needed. Realising this agenda will require declarative schemas for exposing retrievable units and grounding structures, learnable schedulers for allocating compute under budget constraints, and end-to-end training signals that couple retrieval decisions to downstream correctness and safety. Complementary lifecycle tooling is also needed to track dataset refresh cadence, index versioning, schema and prompt versions, and memory governance, along with monitoring frameworks that reveal not only what the model answered but why, from where, and under what uncertainty and resource usage.
Finally, at the level of evidence aggregation, the field would benefit from periodic updates of this review using time-normalised citation windows, expanded language and source coverage, preregistered search and screening protocols, and targeted sensitivity analyses. Such methodological rigour would enhance external validity and support ongoing refinement of the research agenda as the underlying corpus and technological landscape evolve.
These directions chart a path toward RAG systems that are not only more accurate, but also systematically evaluable, resource-efficient, secure, and governable across domains, languages, and modalities.

7. Conclusions

In this systematic review, we synthesised the RAG research landscape through a citation-weighted PRISMA protocol applied to 128 highly cited studies published between 2020 and May 2025. The evidence indicates that the field has cohered around five tightly linked themes—how evidence is surfaced, stored, chunked, represented, and consumed—and that recent systems increasingly move from single-pass pipelines to coordinated stacks in which retrieval policy and generation jointly determine quality.
In relation to RQ1 and RQ2, we show that knowledge-intensive and open-domain QA tasks dominate current practice, with particular depth in code and clinical domains, and that hybrid retrieval, structure-aware indexing, robust chunking, and encoder choice materially shape recall, precision, and faithfulness. Over time, there is a decisive shift from one-shot DPR + seq2seq baselines towards policy-driven systems that control when and how to retrieve, fuse heterogeneous signals (sparse+dense, graph-based), and balance efficiency through re-ranking, compression, long retrieval units, and memory. For RQ3, we find that credible evaluation practices couple answer-level metrics with retrieval diagnostics and groundedness checks, complement human assessment with calibrated LLM-as-judge protocols, and report cost and latency alongside quality. For RQ4, we identify five interdependent challenge families—evidence quality, pipeline coupling, resource budgets, LLM constraints, and security—and show how small retrieval imperfections, budget limits, and incomplete provenance can amplify risk unless addressed within an integrated system design.
The primary contributions of this review are threefold. First, it offers a unified taxonomy that organises retrieval design, vector indexing, chunking, encoder choice, and generation patterns into a modular stack, clarifying where design decisions interact and where improvements transfer. Second, it distils a minimal, practice-oriented evaluation rubric: track retrieval coverage and early relevance; pair tolerant and strict answer metrics; measure groundedness and abstention; include a small-sample user or calibrated LLM-judge signal; and report efficiency and robustness so that results are comparable and actionable. Third, it consolidates a prioritised research agenda that links observed failure modes to implementable baselines for budget-aware retrieval with abstention and re-ranking, lightweight verification for claim support with citations, scheduled index refresh with provenance logging, and minimal robustness probes for distractors and adversarial context, which we elaborate in Section 6.
These findings should be interpreted in light of the boundaries of the review. The citation-weighted selection procedure emphasises influence and transparency, but introduces citation-lag risk and may under-sample recent or niche contributions. The corpus prioritises five major digital libraries plus DBLP and English-language publications, improving deduplication and comparability while likely under-representing grey literature and work in other languages. Although titles and abstracts were double-screened, full-text extraction relied on a single reviewer with verification, supported but not replaced by LLM assistance, leaving residual selection and extraction bias possible. Finally, the search window ending in May 2025 limits temporal generalisability; periodic updates with broader language and source coverage, preregistered protocols, and sensitivity analyses would further strengthen external validity.
This review clarifies how contemporary RAG systems surface, store, chunk, represent, and consume evidence, and how design choices across this stack interact with evaluation practice, resource budgets, and security. As the field progresses, the research directions outlined in Section 6 can guide the development of RAG systems that are not only more accurate but also reliable, efficient, and defensible at scale.

Author Contributions

Conceptualization, A.B. and B.D.; methodology, A.B. and M.R.; software (Python scripts for data conversion and deduplication), A.B.; validation, M.R. and B.D.; formal analysis and data curation, A.B. and M.R.; investigation (literature search, screening, and data extraction), A.B. and M.R.; writing—original draft preparation, A.B.; writing—review and editing, M.R. and B.D.; visualization (PRISMA diagram, tables), A.B.; supervision, B.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Advanced Research and Engineering Centre (ARC) in Northern Ireland, funded by PwC and Invest NI. The views expressed are those of the authors and do not necessarily represent those of ARC or the funding organisations.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All study-level data extracted for this review are publicly available via Zenodo at https://zenodo.org/records/17339384 (accessed on 13 October 2025).

Acknowledgments

The authors appreciate the use of the Kelvin2 High Performance Computing cluster funded by the Engineering and Physical Sciences Research Council (EPSRC) and jointly managed by Queen’s University Belfast (QUB) and Ulster University.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
RAGRetrieval-Augmented Generation
LLMsLarge Language Models
NLPNatural Language Processing
PRISMAPreferred Reporting Items for Systematic Reviews and Meta-Analyses
DPRDense Passage Retriever

Appendix A. Inter-Rater Agreement for Screening Decisions

Table A1. Contingency table for inter-rater agreement (pre-adjudication, record-level).
Table A1. Contingency table for inter-rater agreement (pre-adjudication, record-level).
Reviewer 2: IncludeReviewer 2: ExcludeRow Total
Reviewer 1: Include a = 134 b = 7 141
Reviewer 1: Exclude c = 4 d = 57 61
Column total13864 N = 202
Notes. Observed agreement P o = ( a + d ) / N = 191 / 202 = 0.946 ; expected agreement P e from marginals = 0.573 ; Cohen’s κ = ( P o P e ) / ( 1 P e ) = 0.873 . Disagreements were resolved by discussion.

Appendix B. Study Characteristics Extracted from the Systematic Review

Table A2. Study characteristics extracted for the 128 included RAG studies, organised by domain: datasets, chunking mechanisms, retrieval mechanisms, vector-space encoders, and generation models.
Table A2. Study characteristics extracted for the 128 included RAG studies, organised by domain: datasets, chunking mechanisms, retrieval mechanisms, vector-space encoders, and generation models.
DatasetsChunking MechanismRetrieval MechanismVector Space EncoderGeneration Model
Domain: Knowledge-Intensive Tasks
  • 100-token passages [144]
  • 100-word chunks [1]
  • 6–10 sentences [151]
  • A decompose-then-recompose algorithm splits each retrieved document into smaller strips, filters out irrelevant portions, and reassembles the relevant parts [46].
  • Align passage segmentation with paragraph boundaries [54].
  • Approximately 300 words each [44]
  • Combine short paragraphs when possible [54].
  • Each doctor-patient dialogue as an individual chunk [149].
  • Each document as a distinct chunk [149].
  • Each email as a separate data piece [149].
  • Fixed-length 100-words [61].
  • Fixed-length passages averaging 180 tokens [169]
  • Fixed-size 1200-token chunks [72]
  • Fixed-size 64 tokens [62]
  • Fixed-size 64-token chunks [152]
  • Flexible Intervals 32 tokens [62]
  • K-hop ego-graphs [153]
  • Max of 2000 tokens each chunk [160]
  • Overlapping—half the chunk size [151]
  • Parse each support ticket into a tree structure instead of fixed-length chunks [60].
  • Passages [155]
  • Represent sections like Summary, Description, and Steps to Reproduce as tree nodes [60].
  • Semantic-level Chunking [162]
  • Adaptive Retrieval: The model generates a special “Retrieve” reflection token to determine on demand whether external knowledge is needed [96].
  • BM25 [201]
  • Combines a retrieval-augmented generator with a memory selector. Iteratively refines and improves the generation process [163].
  • Composite structured prompting strategy that includes a command component (e.g., “Please repeat all the context”) to extract the retrieved content effectively [149].
  • Contriever: Uses a contrastive learning framework without supervision [44].
  • Dense Retrieval [1,44,46,48,54,60,61,62,79,87,89,131,133,135,136,144,149,151,152,155,160,161,202]
  • Dense Retrieval—Dynamically triggered by RIND based on the LLM’s information needs [49].
  • Dense Retrieval—FAISS [169,206]
  • Dense Retrieval and Bing Search Engine Retrieval [143]
  • Dual-level retrieval [72]
  • Dynamic Content Prediction—generative model forecasts upcoming content and dynamically forms retrieval queries [143].
  • Efficient K-hop Subgraph Retrieval [153]
  • Evolving-Based Retrieval [164]
  • External search engines—primarily DuckDuckGo [201]
  • Graph database query (e.g., Cypher) [60]
  • Hybrid with HyDE [162]
  • ANCE (Dense) [151]
  • Alibaba-NLP/gte-large-en-v1.5 (Dense) [162]
  • BAAI/LLM-Embedder (Dense) [162]
  • BAAI/bge-base-en (Dense) [162]
  • BAAI/bge-base-en-v1.5 (Dense) [162]
  • BAAI/bge-large-en (Dense) [162]
  • BAAI/bge-large-en-v1.5 (Dense) [162]
  • BAAI/bge-small-en (Dense) [162]
  • BAAI/bge-small-en-v1.5 (Dense) [162]
  • BERT (Dense) [60]
  • BERT-base (Dense) [1]
  • BERT-based (Dense) [54,151,152]
  • BGE (Dense) [151]
  • BGE-Base (Dense) [169]
  • BGE-Large (Dense) [169]
  • BM25 (Sparse) [61,143,162]
  • CLIP Variants (Dense) [151]
  • ColBERTv2 (Dense) [135,169]
  • Contriever (Dense) [44,46,162]
  • Contriever-MSMARCO (Dense) [49]
  • DPR (Dense) [169]
  • Dragon (Dense) [44,169]
  • E5 (Dense) [60,151]
  • E5-Large (Dense) [169]
  • E5-Mistral (Dense) [169]
  • GPT4All [89]
  • GTR (Dense) [136]
  • Graph Attention Network (GAT) [85]
  • Graph Transformer (Dense) [48]
  • OpenAI’s text-embedding-3-large (Dense) [201]
  • SFR (Dense) [169]
  • SentenceBERT (Dense) [153]
  • SentenceBERT (SBERT) (Dense) [48]
  • ActSTD [164]
  • Alpaca [46]
  • Alpaca-13B [96]
  • Alpaca-7B [96]
  • BART [1,155]
  • BART-Large [54]
  • Both fine-tuned small models and few-shot prompted LLMs [163]
  • CRAG [46]
  • ChatGPT [96]
  • CodeLlama-7B [160]
  • DiffTraj [164]
  • Evidentiality prediction: An additional decoder is used for predicting the evidentiality of each passage [202]
  • FLAN-T5 xlarge [85]
  • FLAN-T5 xxlarge [85]
  • Flan-T5 [133,151]
  • Fusion-in-Decoder [155,202]
  • GPT-3.5 [160]
  • GPT-3.5-turbo [89,135,149,162]
  • GPT-3.5-turbo-0613 [61,161,164]
  • GPT-4 [60,160]
  • GPT-4-0613 [61,161]
  • GPT-4o-mini [72,164]
  • GPT-Neo-1.3B [149]
  • GPT-like decoder [152]
  • GPT4All [89]
  • Internvl2.5-8B [151]
  • Llama-13b-Chat [149]
  • Llama-2 [46,133]
  • Llama-2-13B [48]
  • Llama-2-13B (with LoRA fine-tuning) [151]
  • Llama-2-13B-Chat [61,144,161]
  • Llama-2-13b-chat-hf (for scale comparison) [153]
  • Llama-2-70B [44]
  • Llama-2-70B-Chat (4-bitQ) [61,161]
Domain: Knowledge-Intensive Tasks Continued…
  • Sentence-level Chunking [162]
  • Sentences [133]
  • Sentences or sub-sentence [96]
  • Sentences: 64 tokens [143]
  • Sliding Window Chunking Technique [162]
  • Small-to-Big Chunking Technique [162]
  • Splits large documents into smaller text chunks—sentence, paragraph level or often configured to a maximum size (e.g., 500 characters) [89]
  • The graph is converted into CSV-style representations by processing its nodes and edges [48].
  • Token length 256 [206]
  • Token-level Chunking [162]
  • Truncate overly long paragraphs [54].
  • 100 words [151]
  • LLM-driven query reformulation to finalise sub-graph [60]
  • Large-scale web search [46].
  • Learning-Based Retrieval [164]
  • Multimodal [151]
  • Retrieval is dynamically triggered by RIND based on the LLM’s information needs [144].
  • Retrieves subgraphs based on query relevance using the Prize-Collecting Steiner Tree optimization [48].
  • Sparse Retrieval [61,151]
  • Subgraph Retrieval [85]
  • T5 encoder (Dense) [87,202]
  • all-MiniLM-L6-v2 (Dense) [149]
  • bge-large-en (Dense) [206]
  • bge-large-en-v1.5 (Dense) [149]
  • e5-large-v2 (Dense) [61]
  • e5-base-v2 (Dense) [149]
  • embedding OpenAI (No name) [89]
  • intfloat/e5-large-v2 (Dense) [162]
  • intfloat/e5-small-v2 (Dense) [162]
  • jinaai/jina-embeddings-v2-small-en (Dense) [162]
  • multilingual-e5-large (Dense) [61]
  • sentence transformer (Dense) [62]
  • sentence-transformers/all-mpnet-base-v2 (Dense) [162]
  • text-embedding-3-small (Dense) [79]
  • text-embedding-ada-002 (Dense) [44,160]
  • thenlper/gte-base (Dense) [162]
  • thenlper/gte-small (Dense) [162]
  • Llama-2-7B [44,48,151]
  • Llama-2-7B-Chat [61,144,161,201]
  • Llama-2-7b-chat-hf (LoRA fine-tuning) [153]
  • Llama-2-7b-chat-hf (frozen LLM) [153]
  • Llama-3-8B [164]
  • Llama-3-8B-instruct [151]
  • Llama-7b-Chat [149]
  • Llama2-13B [96]
  • Llama2-70B [79]
  • Llama2-7B [79,96,206]
  • Llama2-FT7B (retrieval-fine-tuned Llama2) [96]
  • Llava-7B [151]
  • Llava-ov-7B [151]
  • MiniGPT-4 [48]
  • Mistral-7B [79,206]
  • Mistral-7B-Instruct [61,161,169]
  • Mixtral-8×7B [79,169]
  • NeMo GPT-43B (proprietary) [44]
  • Orca2-7B [206]
  • PaLM-2-S [136]
  • PaLM-2-XXS [136]
  • Perplexity.ai [96]
  • Qwen-1.5-14B [151]
  • Qwen2-vl-7B [151]
  • RETRO+ [62]
Domain: Knowledge-Intensive Tasks Continued…
  • OpenAssistant [151,201]
  • OpenBookQA [151,162,201]
  • OpenBookQA (OBQA) [85]
  • Osaka Personal Activity Trajectory [164]
  • Payment Insurance Calculation.txt [89]
  • Physical Interaction Question Answering (PIQA) [85]
  • PopQA [46,96,151,201]
  • Pre-training Corpus [152]
  • PubHealth [46,96,162]
  • PubMedQA [85,162,169]
  • PwC Reading-Comprehension Corpus [169]
  • QMSum [44]
  • Qasper [44]
  • QuAIL [169]
  • QuALITY [44]
  • RACE [152]
  • RAGTruth [161]
  • REALTOXICITYPROMPTS [152]
  • RealNews [62]
  • RiddleSense [85]
  • SODA [152]
  • SQuAD [162]
  • SQuAD v2 [169]
  • SamSum [169]
  • SceneGraphs [48]
  • SearchQA [1]
  • Self-Instruct [152]
  • StrategyQA [49,135,143,144]
  • T-REx [54,87,135]
  • TACRED [54]
  • TREC DL19 and TREC DL20 [162]
  • Tokyo Personal Activity Trajectory [164]
  • TriviaQA [1,79,87,133,152,155,162,169,202]
  • TriviaQA-unfiltered [96]
  • TruthfulQA [152,162,169]
  • UMLS [85]
  • UltraDomain—Agriculture [72]
  • UltraDomain—CS [72]
  • UltraDomain—Legal [72]
  • UltraDomain—Mixed [72]
  • Unnatural Instructions [152]
  • RETRO++ [152]
  • RETRO-582M [62]
  • Ret-ChatGPT [96]
  • Ret-Llama2-chat [96]
  • SAIL-7B [96]
  • SELF-RAG-13B [96]
  • SELF-RAG-7B [96]
  • SelfRAG-Llama-2 [46]
  • T5 [155]
  • T5-Base [87]
  • T5-Large [87]
  • T5-XL [87]
  • Toolformer-6B [96]
  • TrajGAIL [164]
  • Vicuna-13B [135]
  • Vicuna-13B-v1.5 [144]
  • text-davinci-003 [49,143]
Domain: Knowledge-Intensive Tasks Continued…
  • W3C-Email [149]
  • Web Questions (WebQA) [169]
  • Web-based data [89]
  • WebQSP [48,153]
  • WebQuestions [1,155,162]
  • WikiAsp [143]
  • WikiMultiHopQA [135]
  • WikiQA [169]
  • Wikipedia [46,61,62,79,155]
  • Wikipedia (December 2018) [1,49]
  • Wikipedia (December 2021) [49]
  • Wikipedia (October 2017) [49]
  • Wikipedia Corpus [136]
  • Wikipedia dump (December 2021) [169]
  • Wikitext-103 [149]
  • WinoGrande [152]
  • Wizard of Wikipedia [87,133,202]
  • WizardLM [151,201]
  • Word-in-Context [152]
  • XSum [163]
  • Yelp. 2021 [161]
  • YouTube video content [89]
  • Zero-Shot RE [54,87,135,151]
  • codeparrot/github-jupyter dataset [160]
  • lyft_2021 [162]
Domain: Open-Domain Question Answering
  • 2WikiMQA [47,182]
  • 2WikiMultiHopQA [88]
  • 2WikiMultihopQA [142]
  • AGNews [130]
  • AdvBench [130]
  • BBQ [130]
  • CORD-19 [93]
  • COVID-19 QA [93]
  • CoQA [81]
  • Conceptual Caption [58]
  • Encyclopedic-VQA [63]
  • English Wikipedia (December 2018 Dump) [55]
  • EntityQuestion [134]
  • FreshQA [77]
  • Google Search [77]
  • HotpotQA [47,81,88,128,130,138,142,182]
  • HotpotQA (HQA) [132]
  • IIRC [47]
  • InfoSeek [63]
  • LAION [58]
  • LLaVA-Instruct [63]
  • MS MARCO [130]
  • MetaQA [68]
  • Mintaka [138]
  • MuSiQue [47,142]
  • MultifieldQA-en [128]
  • MultimodalQA [58]
  • Natural Questions [43,55,81,88,128,130,137,138,159,182]
  • Natural Questions (NQ) [142]
  • NaturalQuestions (NQ) [132]
  • NewsQA [93]
  • OK-VQA [53]
  • PDFTriage [47]
  • PopQA [77,134]
  • Probably-Asked Questions [58]
  • QAConv [93]
  • Qasper [128]
  • RealNews [43]
  • RealTimeQA [77,159]
  • Reddit Webis-TLDR-17 Dataset [137]
  • 100 words [43,93,182]
  • Aggregate knowledge-graph triples into aggregated textual statements [68]
  • Batch Grounding: Retrieved documents are processed in user-defined batches (e.g., 3 docs at a time); grounding stops once evidence is cited [142].
  • For PDFs, extract pages and tables as separate nodes [47].
  • Group short documents into longer units (from less than 1k tokens to around 4k tokens) [128].
  • Image is divided into image patches using a sliding window with a set stride [53]
  • Image-only entries [58]
  • Image–text pairs [58]
  • Individual Sentences [134]
  • Maximum sequence length of 256 tokens [130]
  • Non-overlapping segments of 100 words [55,137]
  • Question Decomposition: The LLM breaks the original multi-hop question into simpler single-hop sub-questions before retrieval/encoding [142].
  • Split each document into individual passages (text blocks) [47]
  • Split long units into fixed-size chunks of 512 tokens [128].
  • Summary Paragraph [134]
  • Text-only entries [58]
  • Use 4K-token chunking when applicable [128].
  • Use each node (passage, page, or table) [47]
  • 600 characters [63]
  • Asynchronous Updates—Re-encode and re-index the knowledge base during training [93].
  • BM25 [81]
  • Contriever-MS-MARCO retriever [182]
  • Dense Retrieval [43,55,58,77,81,88,93,128,130,132,137,138,142,159]
  • Document-level retrieval with CLIP [63]
  • Explicit Knowledge Retrieval [53]
  • Google API [142]
  • Hierarchical two-step retrieval [63]
  • Implicit Knowledge Retrieval [53]
  • Iteratively retrieval—candidate relations for the current entity set, then select and rank the most relevant ones using LLM prompts and weighted voting [68].
  • Knowledge Graph Traversal: LLM-based KG traversal agent [47]
  • Passage-level retrieval with Contriever [63]
  • Sparse Retrieval [43,81,128,138,142]
  • ADORE (Dense) [137]
  • BCEmbedding (Dense) [132]
  • BERT-base (Dense) [43]
  • BERT-based (Dense) [93,134]
  • BGE cross-encoder reranker (Dense) [132]
  • BGE-Large-En-V1.5 (Dense) [130]
  • BM25 (Sparse) [43,134,137,138,142]
  • CLIP (Dense) [63]
  • CLIP model—ViT-B/16 (Dense) [53]
  • ColBERTv2 (Dense) [132,142]
  • Contriever (Dense) [43,63,77,134,137]
  • Dense Passage Retriever (Dense) [132]
  • Dual Encoder: BERT (unsupervised training procedure) (Dense) [55]
  • E5-Mistral-7B (Dense) [128]
  • Elastic Learned Sparse Encoder (ELSER) (Sparse) [81]
  • KNN-MDR (Fine-tuned) (Dense) [47]
  • KNN-ST (Dense) [47]
  • MPNet (Dense) [138]
  • Multimodal encoder—T5 and ViT (text and image) (Dense) [58]
  • RankLLaMA (Dense) [132]
  • Sentence Transformers (Dense) [81]
  • Spider (Dense) [43]
  • T5 (text) (Dense) [58]
  • TAGME Entity Linking (Sparse + Semantic) [47]
  • TF-IDF (Sparse) [47]
  • UAE-Large-V1 (Dense) [130]
  • ViT (images) (Dense) [58]
  • bge-large-en-v1.5 (Dense) [128]
  • transformer-based encoder (BERT-based) (Dense) [137]
  • BART [93]
  • Blended RAG [81]
  • ChatGPT [47,68,138]
  • Claude-3-Opus [128]
  • Claude-3.5-Sonnet [128]
  • Decoder (no name) [58]
  • DeepSeek-V2-Chat [128]
  • Falcon-7B (4-BitQ) [137]
  • Flan-250M [138]
  • Flan-Large [138]
  • Flan-T5-XXL [81]
  • Flan-T5-base [134]
  • Flan-T5-large [134]
  • Flan-T5-small [134]
  • Flan-T5-xl [134]
  • Flan-T5-xxl [134]
  • Flan-T5XL [63]
  • Flan-XL [138]
  • Fusion-in-Decoder [55]
  • GLaM (Oneshot) [81]
  • GLaM (Zeroshot) [81]
  • GPT-2 [43]
  • GPT-3 [68]
  • GPT-3.5 [77,132]
  • GPT-3.5-turbo [130,142]
  • GPT-4 [77,132,159]
  • GPT-4-turbo [128]
  • GPT-4o [128]
  • GPT-J [43]
  • GPT-Neo [43]
  • GPT-Neo-1.3B [130]
  • Gemini-1.5-Pro [128]
  • Llama [43,47]
  • Llama-13b-Chat [130]
  • Llama-2-13B [132]
  • Llama-2-70B-Chat [68]
  • Llama-2-7B [77,132,159,182]
  • Llama-2-7B (4-BitQ) [137]
  • Llama-2-Chat [134]
  • Llama-3-8B [132]
  • Llama-3-Chat [134]
  • Llama-33B [88]
  • Llama-7b-Chat [130]
Domain: Open-Domain Question Answering Continued…
  • SQuAD [81]
  • SST-2 [130]
  • StrategyQA [142,182]
  • TREC-COVID [81]
  • The Pile [43]
  • ToolQA [77]
  • TriviaQA [43,55,77,88,159,182]
  • TriviaQA (TQA) [132]
  • Visual Question Answering [58]
  • WebQA [58]
  • WebQSP [68,138]
  • WebQuestions [55,130]
  • WebQuestions (WebQ) [159]
  • WebQuestionsSP (WebQSP) [132]
  • WikiText-103 [43]
  • Wikidata [53,138]
  • Wikipedia [63,77,134,138,159,182]
  • Wikipedia (2017/2018 dump) [142]
  • Wikipedia (December 2018 Dump) [43]
  • Wikipedia (December 2018) [88]
  • Wikipedia English (December 2018) [137]
  • Wikipedia passage corpus [132]
  • WitQA [134]
  • MPT-7B (4-BitQ) [137]
  • MiniCPM [134]
  • Mistral [134]
  • Mistral-7B [132,142]
  • OPT [43]
  • PaLM540B (Oneshot) [81]
  • Phi-2-2.7B [77,132]
  • Phi-2-2.7B (4-BitQ) [137]
  • Qwen-1.5-0.5B [132]
  • Qwen-1.5-1.8B [132]
  • Qwen-1.5-14B [132]
  • Qwen-1.5-4B [132]
  • Qwen-1.5-7B [132]
  • Qwen2-7B [132]
  • RAG-end2end [81]
  • RAG-original [81]
  • Self-RAG-7B [77]
  • StableLM2 [134]
  • T0 [138]
  • T5 (fine-tuned) [47]
  • T5-780M [182]
  • TinyLlama-1.1B [77,134]
  • Vicuna-7B [63]
  • Zephyr [134]
  • encoder-decoder transformer architecture (initialized with models like T5 or BART) [53]
  • text-davinci-002 [88]
  • text-davinci-003 [88,182]
Domain: Software Engineering
  • ACE04 [64]
  • ACE05 [64]
  • AI Tutor [139]
  • BioASQ [139]
  • C Code Summarization Dataset [51,167]
  • CASIE [64]
  • CoNLL03 [64]
  • CoNLL04 [64]
  • Code Refinement [80]
  • CodeMatcher [181]
  • CodeSearchNet [167,181]
  • CodeXGLUE [167,181]
  • Cognitive Reviewer [139]
  • Concode [167]
  • CrossCodeEval [73]
  • CrossCodeLongEval [73]
  • Defects4J [80]
  • Django [181]
  • Hearthstone [181]
  • InferredBugs [129]
  • Multi-programming Language Commit Message [170]
  • NL2Bash [67]
  • NLC2CMD [67]
  • NYT [64]
  • PyTorrent [181]
  • Python Code Summarization Dataset [51]
  • RTLLM [147]
  • RepoEval [50,73]
  • ServiceNow Internal Data [145]
  • TFix [80]
  • The Stack [73]
  • VerilogEval [147]
  • VerilogEval-syntax [147]
  • 20 lines per chunk [73]
  • 50 lines per chunk [73]
  • Code Property Graphs from source code [51]
  • Code Segments [129]
  • Code Snippets [67]
  • Code diff and commit message [170]
  • Code snippets, bug-fix pairs, or other programming language constructs [80]
  • Fixed-size sliding window [73]
  • Fragment-alignment [73]
  • Heuristics-based chunking: use punctuation and paragraph breaks [139].
  • Partition code files using a sliding window approach [50]
  • Semantic chunking: use the text’s inherent semantics [139].
  • Sentence [64]
  • Stride = ½ chunk size (overlap) [73]
  • A curated retrieval database is used, where compiler error messages (error tags) are exactly matched to stored human solutions [147].
  • Anonymous sentence embedding-based retrieval strategy [64]
  • Dense Retrieval [50,73,129,139,145,167]
  • Dense Retrieval—Lucene [181]
  • Header2Code [181]
  • Hybrid Patch Retriever: Lexical-based and Semantic-based [80]
  • Iterative Retrieval [50]
  • NL2Code [181]
  • NL2NL [181]
  • Retrieving Similar Code [51]
  • Semantic Code Diff Retriever [170]
  • Semantic and Lexical Similarity [67]
  • Sparse Retrieval [50,73,167]
  • Supports Unimodal and Bimodal [167]
  • The retrieval process enriches the input prompt with specific instructions and demonstrations for syntax error resolution [147].
  • BM25 (Sparse) [80,167]
  • BiLSTM (Dense) [51]
  • Bidirectional transformer encoder model (no name) (Dense) [129]
  • CodeBERT [181]
  • CodeBERT (fine-tuned) (Dense) [67]
  • CodeBERT and GraphCodeBERT (named SCODE-R) (Dense) [167]
  • CodeDiff Encoder (Dense) [170]
  • CodeT5’s encoder (Dense) [80]
  • Commit Message Encoder (Dense) [170]
  • Jaccard token-set (Sparse) [73]
  • MPNet (Dense) [64]
  • RoBERTa [181]
  • Sentence-BERT [181]
  • TF-IDF (Sparse) [73]
  • UniXcoder (Dense) [50,73]
  • Weighted n-gram (Sparse) [73]
  • all-mpnet-base-v2 (Dense) [145]
  • bag-of-words (Sparse) [50]
  • gtr-t5-base (Dense) [145]
  • gtr-t5-large (Dense) [145]
  • gtr-t5-xl (Dense) [145]
  • gtr-t5-xxl (Dense) [145]
  • Attention-based LSTM model [51]
  • CODEGEN-2B [50]
  • CODEGEN-350M [50]
  • CODEGEN-6B [50]
  • ChatGPT [139]
  • CodeGPT [181]
  • CodeGen-Mono-16B [73]
  • CodeGen-Mono-2B [73]
  • CodeGen25-7B [73]
  • CodeLlama-16B [73]
  • CodeLlama-7B [73,145]
  • CodeT5 [80]
  • Decoder [170]
  • Exemplar Guider [170]
  • GPT-3.5 [147]
  • GPT-3.5-turbo [50]
  • GPT-3.5-turbo-0613 [73]
  • GPT-4 [139,147]
  • LSTM [181]
  • Mistral-7B-v0.1 [145]
  • PLBART (named SCODE-G) [167]
  • Repoformer-16B [73]
  • Repoformer-1B [73]
  • Repoformer-3B [73]
  • Repoformer-7B [73]
  • StarCoder-16B [73]
  • StarCoderBase-15.5B [145]
  • StarCoderBase-1B [73,145]
  • StarCoderBase-3B [73,145]
  • StarCoderBase-7B [73,145]
  • Three encoders [170]
  • Transformer Decoder [181]
  • Transformer-XL [181]
  • Transformer-based architecture—decoder component [67]
  • code-cushman-001 (Codex large language model—fine-tuned) [129]
  • gpt-3.5-turbo [181]
  • gpt-3.5-turbo-16k-0613 [147]
  • text-davinci-002 [64]
  • text-davinci-003 [64,181]
Domain: Medical
  • 14 De-identified Clinical Scenarios [56]
  • 30 publicly available American Association for the Study of Liver Diseases [126]
  • 35 Preoperative Guidelines [56]
  • Apnea-ECG [150]
  • Biomedical Instructions [158]
  • CXR-PRO [173]
  • Clinical Practice Guidelines [158]
  • Dynamed [185]
  • European Association for the Study of the Liver [183]
  • Harvard-FairVLMed [91]
  • Hospital Neurology Discharge Summaries [165]
  • Human-generated responses [56]
  • IU-Xray [91]
  • LiveQA: Long-form question answering dataset [158].
  • MIMIC-CXR [91]
  • MMLU (Med) [158]
  • MS-CXR [173]
  • MedInstruct [158]
  • MedMCQA: Multi-choice question answering dataset [158].
  • MedQA: Multi-choice question answering dataset [158].
  • Medical Textbook [158]
  • MedicationQA [158]
  • Mol-Instructions [158]
  • Northern American HCV Guidelines [183]
  • Online Sources Nursing Knowledge JSON [165]
  • PubMed Central Full-text [158]
  • PTB-XL [150]
  • PTB-XL+ [150]
  • Patient Inquiry Dataset [165]
  • Patient Symptom Record Dataset [165]
  • PubMed Abstract [158]
  • PubMed Clinical Papers [185]
  • Scoliosis Research Society’s [185]
  • UpToDate [185]
  • 1000 tokens with an overlap of 100 tokens [56]
  • 128 words with a 32-word overlap [158]
  • 2000 tokens [185]
  • Apnea-ECG: Split recordings into one-minute segments [150].
  • Cleans text [183]
  • Converts tables into text-based lists [183]
  • Full medical reports [91]
  • Hierarchical sections [65]
  • Labels paragraphs [183]
  • LangChain’s RecursiveCharacterTextSplitter [65]
  • PTB-XL+: Use pre-made ECG features as natural chunks [150].
  • Paragraphs [65]
  • single-disease/topic JSON entries (one entry = one chunk) [165]
  • Dense Retrieval [91,127,150,158,165,183,185]
  • Microsoft Azure Cognitive Search services [126]
  • Multimodal Retrieval—Dense Retrieval [65]
  • Pinecone’s Retrieval Agent [56]
  • Similarity of embeddings [173]
  • ALBEF (Dense) [173]
  • BM25 (Sparse) [65]
  • BioClinicalBERT (Dense) [91]
  • Contriever (Dense) [65]
  • Employs Microsoft Azure OpenAI’s ADA Text Embedding Version 2 model (text-embedding-ada-002) (Dense) [126]
  • MPNet (Dense) [185]
  • MedCPT (Dense) [65,158]
  • OpenAI’s text-embedding-ada-002 (Dense) [56]
  • ResNet-50 (Dense) [91]
  • SPECTER (Dense) [65]
  • text-embedding-ada-002 (Dense) [150,165]
Domain: Other
  • 1000-User Benchmark Subset [166]
  • Aggregated flood event listings from EMSR, GDACS, and ReliefWeb [186]
  • Amazon Movie Reviews [174]
  • Australian Open Legal QA [69]
  • Bing Search Logs [166]
  • CaseHOLD [69]
  • Census/projection-disaggregated gridded population datasets [186]
  • ChEBI-20 Dataset [146]
  • Colossal Clean Crawled Corpus [146]
  • Facebook Books [157]
  • FloodBrain ablation study dataset [186]
  • FloodBrain evaluation dataset [186]
  • Harvard Law Case Corpus [69]
  • Human-Edited Counterfactuals Subset of IMDb [174]
  • IMDb Movie Review [174]
  • Inherent Paired Nature of MNLI (Premise and Hypotheses) [174]
  • LEDGAR [69]
  • LaMP [207]
  • LegalBench collection [69]
  • MNLI [174]
  • Microsoft Research Paraphrase Corpus [76]
  • MovieLens100K [157]
  • OpenStreetMap Planet dump [186]
  • ParaSCI-ACL [76]
  • Quora Question Pairs 140K [76]
  • Quora Question Pairs 50K [76]
  • ReliefWeb flood reports [186]
  • The human cost of disasters (2000–2019) [186]
  • WaNLI [174]
  • Wikipedia [174]
  • Yelp Reviews [174]
  • ZINC-15 [146]
  • A prefiltering strategy to manage the token limit imposed by the API [157].
  • Bing Search Logs May–July 2023) comprising user queries and clicked results, filtered and sampled to 1000 users for evaluation [166]
  • Each legal case is broken down into question, support snippet, extracted entities, and an answer [69].
  • Full-text pages from Wikipedia and a curated set of 500 high-traffic news domains, retained to maximize reliable entity linking [166]
  • LangChain with a chunk size (set as 1000 tokens) and chunk overlap (200 tokens) [71]
  • Small prompt segments (p1:s and q1:t) [76]
  • BM25-based Caption Retrieval [146]
  • Counterfactual Dense Retrieval (CF-DPR) [174]
  • Dense Retrieval [71,76,166]
  • Google Custom Search [186]
  • Morgan Fingerprints-based Molecule Retrieval [146]
  • ROPG-KD [207]
  • ROPG-RL [207]
  • Three-pronged retrieval approach: Intra query matching, Inter context matching, Hybrid weighted retrieval [69]
  • AnglE-BERT (Dense) [69]
  • BERT (Dense) [69]
  • Contriever (Dense) [166,207]
  • LegalBERT (Dense) [69]
  • Morgan Fingerprints: Converts molecular SMILES representations into binary bit vectors that capture the presence or absence of chemical substructures [146].
  • Sparse encoding mechanism, effectively capturing detailed structural features of molecules [146].
  • Two independent BERT encoders (Dense) [174]
  • paraphrase-mpnet-base-v2 (Dense) [76]
Domain: Evaluation
  • 2WikiMultihopQA [208]
  • ClashEval Drug Dosage [180]
  • ClashEval Locations [180]
  • ClashEval Names [180]
  • ClashEval News [180]
  • ClashEval Sports Records [180]
  • ClashEval Wikipedia Dates [180]
  • EN.MC [208]
  • En.QA [208]
  • FEVER [187]
  • Factual Recall Questions [98]
  • False Premise Questions [98]
  • General Legal Research [98]
  • HotpotQA [187,208]
  • Jurisdiction or Time-Specific Research [98]
  • MuSiQue [208]
  • MultiFieldQA [208]
  • MultiHop-RAG dataset [176]
  • MultiRC [187]
  • NarrativeQA [208]
  • Natural Questions [187]
  • QMSum [208]
  • Qasper [208]
  • RGB (Retrieval-Augmented Generation Benchmark) [178]
  • ReCoRD [187]
  • WikiEval [188]
  • Wizards of Wikipedia [187]
  • Fixed-size sliding windows of 300 words per chunk [208]
  • Contriever (Dense) [208]
  • Dragon (Dense) [208]
  • Ask Practical Law AI proprietary LLM [98]
  • Claude-3-Opus [180]
  • Claude-3.5-Sonnet [180]
  • GPT-3.5-turbo [208]
  • GPT-3.5-turbo-0125 [180]
  • GPT-4-turbo-2024-04-09 [98]
  • GPT-4o [180,208]
  • Gemini-1.5-Flash [180]
  • Gemini-1.5-Pro [208]
  • Lexis+ AI proprietary LLM [98]
  • Llama-3-8B-instruct [180]
  • Westlaw AI-Assisted Research (GPT-4-based) [98]
Domain: Multimodal
  • ActivityNet Captions [84]
  • CC12M [83]
  • CC3M [83]
  • CCS [83]
  • COCO [92]
  • COYO-700M [83]
  • Common Objects in Context [82]
  • Flickr30k [82,83]
  • Google Search corpus [86]
  • LAION [52]
  • MS-COCO [52,83]
  • MSR-VTT [84]
  • MSVD [84]
  • NoCaps [83]
  • OKVQA [86]
  • SBU [83]
  • VATEX [84]
  • CLIP-ViT grid patches [83]
  • Image and Sentences pairs [82]
  • Single Image [92]
  • Subword [83]
  • Uniformly sample frames or clips, up to 25 per video—Apply temporal deformable convolution [84].
  • Dense Retrieval [83,84,86,92]
  • Multimodal Retriever—Dense Retrieval [52]
  • Similarity Search [82]
  • BERT-base (Dense) [86]
  • Byte Pair Encoding (BPE) and sinusoidal positional encodings (Dense) [92]
  • CLIP (Dense) [52]
  • CLIP (query) (Dense) [82]
  • CLIP ResNet-based visual encoder (Dense) [92]
  • Images: CLIP-ViT (Dense) [83]
  • LXMERT (image and text) (Dense) [82]
  • Text: transformer-based bidirectional encoder (Dense) [83]
  • temporal deformable convolutional encoder (Dense) [84]
  • CM3 Model [52]
  • RETRO [83]
  • T5 [86]
  • Transformer encoder-decoder architecture [92]
  • Transformer-based GPT-2 model [82]
  • V&L encoder is used with a decoder for image captioning [82]
  • fully convolutional decoder [84]
Domain: Conversational AI
  • CoQA [119]
  • CCNet [45]
  • ConvFinQA (CFQA) [119]
  • DoQA [119]
  • Doc2Dial (D2D) [119]
  • Emotion-Specific Dialogue [184]
  • Gender-Specific Dialogue [184]
  • HybriDial (HDial) [119]
  • INSCIT [119]
  • LightQA [140]
  • LightWild [140]
  • MultiWOZ 2.1 [171]
  • OpenQA-NQ [140]
  • QReCC [119]
  • QuAC [119]
  • SQA [119]
  • Sentiment-Specific Dialogue [184]
  • TopiOCQA (TCQA) [119]
  • Weibo [168]
  • Wikipedia [45]
  • Wizard of Wikipedia [45,140]
  • Wizard of the Internet [45]
  • 100 Words [45]
  • First 256 tokens (Search Engine) [45]
  • Fixed-length text chunks (300 words) [119]
  • Same as RAG DPR (Token) [140]
  • Summarized memory slot [168]
  • Dense Knowledge Retrieval (DKR) [171]
  • Dense Retrieval [45,119,140,184]
  • Memory module [168]
  • Search Engine Retrieval (Bing Search API) [45]
  • Dragon (Dense) [119]
  • E5-unsupervised (Dense) [119]
  • GRU (Dense) [168]
  • Pre-trained DPR model from the KILT Benchmark (Dense) [45]
  • RoBERTa (Dense) [171]
  • Same as RAG DPR (Token) (Dense) [140]
  • BART-Large [45,140,171]
  • BlenderBot [45]
  • ChatQA-1.0 7B, 8B, 13B, 22B, 70B 13B [119]
  • Command R+ (104 B) [119]
  • Fusion-in-Decoder [45]
  • GPT-2-based [184]
  • GPT-3.5-Turbo-0613 [119]
  • GPT-4-0613 [119]
  • GPT-4-Turbo-2024-04-09 [119]
  • GPT-SFT-22B [119]
  • GPT-SFT-8B [119]
  • Llama-2-13B-Chat [119]
  • Llama-2-70B-Chat [119]
  • Llama-2-7B-Chat [119]
  • Llama-3-70B-Instruct [119]
  • Llama-3-8B-Instruct [119]
  • Llama2-SFT-13B [119]
  • Llama2-SFT-70B [119]
  • Llama3-ChatQA-1.5-70B [119]
  • Llama3-ChatQA-1.5-8B [119]
  • Seq2Seq [168]
  • T5 [45,140]
Domain: Security/Vulnerabilities
  • Agent-Driver [74]
  • EHRAgent [74]
  • Enron Email corpus [204]
  • Harry Potter Series (Books3 subset) [205]
  • HotpotQA [204]
  • MS MARCO [177,204]
  • Natural Questions [177,204]
  • SQuAD [177]
  • StrategyQA [74]
  • WikiASP [177]
  • WikiQA question set [205]
  • Wikipedia [205]
  • Documents are split into fixed-length contiguous passages [204]
  • Fixed-size overlapping chunks [205]
  • ANCE (Dense) [74]
  • BGE (Dense) [74]
  • BM25 (Sparse) [205]
  • Contriever (Dense) [177,204]
  • Contriever-MS MARCO (Dense) [204]
  • DPR encoder (Dense) [204]
  • Dense Passage Retriever (Dense) [74]
  • JinaBERT (Dense) [177]
  • LLaMA Embedding (Dense) [177]
  • ORQA (Dense) [74]
  • Proprietary dense encoder inside NVIDIA Chat-with-RTX (Dense) [204]
  • REALM (Dense) [74]
  • text-embedding-ada-002 encoder (Dense) [74]
  • Claude-3-Opus [177]
  • Customized GPT instances [179]
  • GPT-3.5 [179]
  • GPT-3.5-turbo [74,204]
  • GPT-4 [177,179,204,205]
  • Gemma-2B [204]
  • Gemma-7B [204]
  • Llama-2-13B-Chat [205]
  • Llama-2-70B-Chat [205]
  • Llama-2-7B-Chat [205]
  • Llama-2-7b-chat-hf [177]
  • Llama-3-70B [74]
  • Llama-3-8B [74,204]
  • Llama-3-8B-instruct [205]
  • Mistral 7B-int4 [204]
  • Mistral-7B [179]
  • Mistral-7B-Instruct [205]
  • Mixtral-8×7B [205]
  • Platypus 2-Instruct-70B [205]
  • Qwen-1.5-72B-Chat [205]
  • SOLAR-10.7B [205]
  • Vicuna-13B [204,205]
  • Vicuna-7B [204]
  • WizardLM-13B [205]
Domain: Biomedical
  • ADInt [70]
  • Ade-corpus-v2 [70]
  • Alzheimer’s Knowledge Base (AlzKB) [148]
  • BioChatter Benchmark [209]
  • ChemProt [70]
  • DDI [70]
  • GIT [70]
  • GIT-RE [70]
  • MTsample [70]
  • Multiple Choice Questions Dataset [57]
  • PubMed Clinical Papers [78]
  • RAG Comparison Dataset [57]
  • True/False Dataset [57]
  • UMLS [70]
  • Extracts multiple contextual associations (chunks) from SPOKE via REST-API calls [57].
  • Nodes and Edges [148]
  • Split text into chunks of five consecutive tokens/words [70].
  • Dense Retrieval [57,70,78,209]
  • Dense Retrieval—Weaviate [148]
  • SPOKE’s REST-API [57]
  • Sparse Retrieval [148]
  • MedLLaMA-13B (Dense) [70]
  • MiniLM (Dense) [57]
  • OpenAI’s text-embedding-ada-002 (Dense) [78]
  • PubMedBert (Dense) [57]
  • all-MiniLM-L6-v2 (Dense) [57]
  • BioGPT [148]
  • GPT-3.5 [78]
  • GPT-3.5-turbo [57]
  • GPT-4 [57,78]
  • GPT-4 (in zero-shot settings) [70]
  • KRAGEN’s LLM [148]
  • Llama-2-13B [70]
  • Llama-2-13b [57]
  • Llama-3.1-8B [70]
  • MedLlama-13B [70]
  • Microsoft’s Prometheus [78]
  • OpenChat [148]
  • text-davinci-003 [78]
Domain: Education
  • MongoDB-Logs [59]
  • Lay Language Synthesis Corpus [141]
  • Data Mining and Text Analytics Course Materials Corpus [66]
  • GPT-Generated Answer Evaluation Corpus [59]
  • Lecture-Material [59]
  • Lumos-QG-Generated QA Dataset [66]
  • Math Nation Queries [154]
  • OpenStax Prealgebra Textbook [154]
  • MongoDB-QA [59]
  • TAM Questionnaire Response Set [59]
  • Unified Medical Language System [141]
  • Wikipedia [141]
  • Character-level text splitting of PDFs into fixed-length [66]
  • LangChain TextSplitter [59]
  • Textbook by sub-section [154]
  • Dense Retrieval [141,154]
  • Dense Retrieval—LangChain [66]
  • Dense Retrieval—Weaviate [59]
  • MongoDB [59]
  • BERTBase (Dense) [141]
  • text-embedding-ada-002 (Dense) [59,66,154]
Domain: Information Extraction
  • ACE 2005 [156]
  • De-identified electronic health records [172]
  • RAMS [75]
  • WikiEvents [75,156]
  • Fixed-size chunks (600 characters) [172]
  • Sentence [75]
  • Structured data (like weight tables) is split row by row [172]
  • Adaptive Hybrid Retrieval (hybrid retrieval) [75]
  • Context-Consistency Retrieval (hybrid retrieval) [75]
  • Dense Retrieval [156,172]
  • Schema-Consistency Retrieval (hybrid retrieval) [75]
  • Sentence Transformer (Dense) [172]
  • SentenceBERT (Dense) [75,156]
  • BART-Large [156]
  • Llama-2-13B [172]
  • T5 [75]
Domain: Financial
  • AlphaFin-Test subset [90]
  • FinanceBench [125]
  • Financial News [90]
  • Financial Reports [90]
  • Financial Reports CoT [90]
  • Real-time Market Data [90]
  • Research datasets [90]
  • StockQA [90]
  • 128, 256, 512 tokens [125]
  • Coarse-grained: ChatGPT summary of each document [90].
  • Document structure extraction via the Chipper model [125].
  • Fine-grained: RefGPT generates multiple (question, answer) pairs per document [90].
  • Merging strategy [125]
  • Resulting chunks (summaries or Q-A pairs) are individually embedded for retrieval [90].
  • Two-level extraction prior to embedding [90]
  • BGE (Dense) [90]
  • SGPT (Dense) [90]
  • multi-qa-mpnet-base-dot-v1 (Dense) [125]
  • ChatGLM2-6B [90]
  • GPT-4 [125]
  • Mixtral-8×7B [125]
  • StockGPT [90]

Appendix C. Datasets Table (Extracted from SLR)

Table A3. Summary of datasets utilised in the studies included in this systematic literature review of RAG. It outlines the key characteristics and origins of each dataset, offering an overview that enhances understanding of the data employed across the reviewed research articles. This summary supports an analysis of the trends and methodologies specific to RAG, showcasing the variety and scope of datasets applied in this area of research.
Table A3. Summary of datasets utilised in the studies included in this systematic literature review of RAG. It outlines the key characteristics and origins of each dataset, offering an overview that enhances understanding of the data employed across the reviewed research articles. This summary supports an analysis of the trends and methodologies specific to RAG, showcasing the variety and scope of datasets applied in this area of research.
Dataset NameContent DescriptionIntended UseCitation Frequency
Natural Questions (NQ) [190]323,045 QA examples across train/dev/test splits.Train and evaluate open-domain QA systems.27
HotPotQA [196]113,000 multi-hop QA pairs.Train/test QA with multi-hop reasoning and explanations.26
Wikipedia [1]6 million articles of text and metadata.General corpus of Wikipedia text for NLP tasks.19
TriviaQA (TQA) [210]96,000 QA pairs with six supporting documents each.Develop comprehension models requiring complex inference.18
2WikiMultihopQA (2WikiMQA) [197]192,606 multi-hop QA pairs from Wiki data.Multi-hop QA using structured and unstructured sources.11
Multihop Questions via Single-hop Question Composition (MuSiQue) [211]25,000 2–4-hop questions (50,000 with contrast).Multi-hop QA by composing single-hop questions.9
Fact Extraction and VERification (FEVER) [212]185,445 claims annotated with evidence.Designed for verifying claims using Wikipedia as the textual source8
Microsoft MAchine Reading COmprehension (MS MARCO) [191]100,000 questions and 1 M passages from web docs.Reading comprehension and QA from real web data.8
StrategyQA [213]2780 yes/no questions with step-by-step reasoning.Benchmark Boolean QA needing implicit multi-hop reasoning.8
Wizard of Wikipedia (WoW) [214]22,311 dialogues (202K utterances) using Wiki info.Dialogue with a “wizard” answering via Wikipedia.8
WebQuestions (WebQ) [215]6642 QA pairs from real user web queries.Semantic parsers using Freebase KG.7
Arc-Challenge [216]2590 science multiple-choice questions.Benchmark deep-reasoning QA systems.5
Explain Like I’m Five (ELI5) [217]72k QA pairs with supporting web documents.Long-form QA understandable by five-year-olds.5
Massive Multitask Language Understanding (MMLU) [218]57 task-specific single-sentence summaries.Benchmark broad knowledge and reasoning coverage.5
NarrativeQA [219]1572 narratives, 46,765 QA pairs.QA over long narratives and summaries.5
PopQA [220]14,000 Wikipedia QA pairs across 16 relations.QA focusing on Wikidata relationship types.5
WebQuestions Semantic Parses (WebQSP) [221]SPARQL queries for 4737 questions, 1073 partial.KB-QA research using Freebase semantic parses.5
Wikipedia English (December 2018) [222]21 Million passages from December 2018 English Wikipedia.Passage corpus for retrieval and QA tasks.5
Answer Summaries for Questions which are Ambiguous (ASQA) [223]12,632 ambiguous QA annotations.Long-form QA for ambiguous factoid questions.4
OpenbookQA (OBQA) [224]6k science MCQs with 1326 core facts.Multi-hop science QA using core facts.4
Stanford Question Answering Dataset (SQuAD) [225]23k passages, 108k questions (span answers).Reading comprehension with span answers.4
Triple-based Relation Extraction (TREx) [226]3.09 Million abstracts with 11 Million triples.Relation extraction and KB population tasks.4
TruthfulQA [227]817 questions across 38 categories.Evaluate factual consistency in QA.4
Zero Shot RE (zsRE) [228]Over 30 M QA examples for relation extraction.Zero-shot relation extraction without examples.4
Conversational Question Answering (CoQA) [229]127k questions from 8k multi-turn dialogues.Build conversational QA systems.3
MultifieldQA-en (MFQA) [230]150 docs, 150 cases, 4.6k words each.Single-document long-context QA.3
Physical Interaction: Question Answering (PIQA) [231]16,000 physical commonsense MCQs.Reason about everyday physical tasks.3
PubMedQA [195]PubMed abstracts QA (yes/no/maybe)Biomedical QA benchmarking.3
Qasper (QASP) [232]416 papers, 371 cases, 4.7k tokens per docAcademic QA over research papers.3
Unified Medical Language System (UMLS) [233]Integrated biomedical vocabularies.Standardize medical terminologies.3
Wikipedia Aspect-based summarization (WikiAsp) [234]320,272 docs with section-title aspects.Aspect-based summarization of Wikipedia articles.3
WikiQA [235]3047 questions with Wikipedia candidate sentences.Evaluate answer-sentence selection in QA.3
Bamboogle [236]125 handcrafted 2-hop reasoning questions.Evaluate compositional reasoning capabilities.2
BioASQ [237]4k+ PDFs and 1k domain-specific questionsBiomedical retrieval & QA tasks.2
BoolQ [238]16,000 yes/no questions with passages.Boolean question answering2
C Code Summarization Dataset (CCSD) [51]95k function–summary pairs.Source code summarization.2
CNN/Daily Mail [239]News articles paired with human-written summaries.Summarization and hallucination benchmarking.2
Code mixed-language GLUE (General Language Understanding Evaluation) (CodeXGLUE) [240]Millions of code—NL pairs across tasks.Code understanding and generation.2
CodeSearchNet (CSNet) [199]6 M functions, 2 M docstring pairs in six langs.Semantic code search evaluation.2
Colossal Clean Crawled Corpus (C4) [241]Billions of English tokens from web.Unsupervised pre-training for NLP models.2
Common Crawl dump of the internet (CCNet) [242]1.5 B documents, 532 B tokens across 174 langs.Pre-training large-scale language models.2
Common Objects in Context (COCO) [198]330k images, 1.5 M captions.Object recognition and image captioning.2
CommonsenseQA [243]12,247 MCQs from ConceptNet subgraphs.Evaluate commonsense question answering.2
Conceptual Caption (CC) [244]3.3 M image–text pairs.Pretrain vision-language models.2
Dolly [245]15k human-crafted instruction–response pairsInstruction-following model training.2
Enron Email [167]500k corporate emails for PII extraction tasksEvaluate PII detection and removal2
ExplaGraphs [246]3166 belief-argument-explanation graphs.Commonsense reasoning via explanation graphs.2
Flickr30k [247]30k images with five captions each.Image captioning research.2
Google Search corpus (GSfull) [248]280k sentences from Google Search snippets.Visual QA (OK-VQA) supporting data.2
HellaSwag [249]70k multiple-choice from ActivityNet/WikiHowCommonsense reasoning evaluation.2
Incomplete Information Reading Comprehension Questions (IIRC) [250]13,441 questions, 5698 paragraphs.Challenging reading comprehension.2
LAION [251]Billions of image–text pairs.Train multi-modal language-vision models.2
MultimodalQA [252]30k questions, 58k images, text, tables.Multi-modal QA requiring joint reasoning.2
Outside-Knowledge Visual Question Answering (OKVQA) [253]14k visual questions needing external knowledge.Visual QA with outside knowledge.2
PubHealth [254]True/false health-claim questions.Health-claim verification.2
PubMed Clinical Papers [255]Millions of biomedical abstracts.Biomedical literature retrieval.2
QMSum [256]Meeting transcripts with query-based summaries.Query-focused dialogue summarization.2
RealNews [257]120 GB news articles from Common Crawl.News summarization benchmark.2
RealTimeQA [77]Weekly news quizzes on politics, business, entertainment.Evaluate QA on current events requiring retrieval.2
RepoEval [50]Curated GitHubrepos for code completion benchmarks.Evaluate repository-level code completion.2
WikiData [258]Structured knowledge graph for Wikipedia.Knowledge-base for various QA tasks.2
Wikipedia (December 2021) [259]37 M passages, 78-word average.Updated Wikipedia text corpus.2
Wikipedia Event (WikiEvent) [260]246 docs, 6132 sentences, 3951 events.Event extraction and coreference analysis.2
WikiText (WikiText) [261]103 M words (103); 2 M words (2).Evaluate long-context language modeling.2
1000-User Benchmark Subset [166]1000 user-session sample with 493 queries avg.Train and evaluate personalized query prediction.1
14 De-identified Clinical Scenarios [56]14 anonymized patient scenarios with structured data.Evaluate clinical query handling.1
2019 TREC Deep Learning track (TREC DL19) [262]2019 deep-learning track for passage ranking.Benchmark passage ranking in IR.1
2020 TREC Deep Learning track (TREC DL20) [263]2020 deep-learning track for passage ranking.Benchmark passage ranking in IR.1
35 Preoperative Guidelines [56]35 guidelines on preoperative assessment and care.RAG knowledge for pre-op instructions.1
ACE04 [264]300k words train, 50k words evaluationEntity/relation extraction.1
ActivityNet Captions [265]Consists of 20,000 YouTube videos with 100k localized sentences.Dense video event description modeling.1
ade-corpus-v2 [266]Sentences labeled for adverse drug reactions.Text classification focus on ADE detection in biomedical texts1
Adversarial Benchmark (AdvBench) [267]520 harmful queries simulating jailbreak attacks.Support defense against adversarial prompts.1
Adversarial NLI (ANLI) [268]Adversarial inference examples.Evaluating the inference and reasoning robustness of language models.1
Adverse Drug Effect (ADE) [266]2972 documents on adverse drug effectsTrain ADE extraction models.1
Agent-Driver [269]23,000 driving episodes with states, objects, reasoning chains, actions.Retrieval-based memory for safe driving planning.1
Aggregated flood event listings from EMSR, GDACS, and ReliefWeb [186]Curated list of major global flood disasters.Provide event codes for UI.1
AGNews [270]496k news articles in four topics.Topic classification in news.1
AI Tutor [139]Course PDFs, HTML, and video transcripts.Retrieve source-based answers for students.1
AIDA CoNLL-YAGO [271]CoNLL03 news articles linked to YAGO entities.Named entity disambiguation tasks.1
Alzheimer’s Disease Interventions (ADInt) [272]Pharmaceutical interventions entries.Advance AD intervention knowledge extraction.1
Alzheimer’s knowledge graph (AlzKB) [273]Neo4j dump of genes, diseases, drugs with NL statements and embeddings.Drive precise biomedical RAG for Alzheimer’s queries.1
Amazon Book Reviews [274]Reviews with user, product IDs, ratings.Analyze book recommendation and sentiment.1
Amazon Movie Reviews [275]42 M reviews, 10 M users, 3 M items.Recommender-system and sentiment analysis1
AmbigQA [276]14,042 ambiguous open-domain questions with rewrites.Benchmark QA systems’ disambiguation ability.1
American Association for the Study of Liver Diseases (AASLD) [126]30 liver disease clinical practice guidelinesReference for hepatology QA tasks1
Apnea-ECG Dataset (Sleep Apnoea Detection) [277]70 long ECG recordings with minute-wise apnea labels.Detect sleep apnoea via ECG variability.1
Arc-Easy [216]5197 easy science multiple-choice questionsBenchmark simple science QA1
Australian Open Legal QA (ALQA) [192]232K legal docs, 69.5 M lines, 1.47 B tokens.Legal AI research on Australian law.1
Automatic Content Extraction 2005 (ACE 2005) [264]625k annotated words in English, Arabic, ChineseTrain entity, relation, event extraction.1
Avocado Research Email Collection [278]Corporate email archive with threads and metadata.Retrieval-augmented personalized email drafting.1
Bias Benchmark for Question Answering (BBQ) [279]Multiple-choice QA testing nine social bias categories.Diagnose representational harms in QA.1
BigPatent [280]1.34 M patent documentsAbstractive text summarization.1
Bing Search Logs [281]Three months of anonymized Bing queries and clicks.Build search-history memory for query suggestion.1
BioChatter Continuous-Monitoring Benchmark Suite [209]Growing suite of biomedical LLM workflow tasks.Track performance over evolving system features.1
BioChatter Knowledge-Graph Query-Generation Benchmark [209]QA pairs with correct BioCypher graph queries.Evaluate LLM-to-KG query translation accuracy.1
Biography [282]Long-form biographical narratives of various entities.Test biographical text generation.1
Biomedical Instructions [158]18k generated biomedical and clinical instruction sets.Fine-tune models on diverse biomedical tasks.1
Biomedical Multiple Choice Questions (MCQ) [283]Biomedical MCQs with five answer options.Evaluate biomedical multiple-choice QA.1
CaseHOLD [284]846K contract provisions with 12.6K refined labels.Benchmark legal question-answering systems.1
Census/projection-disaggregated gridded population datasets [285]2020 global population grid disaggregated by census.Quantify populations in flood zones.1
Chain-of-thought [286]Explicit multi-step reasoning demonstrationsFoster coherent stepwise reasoning1
ChEBI-20 [287]33,010 molecule-caption pairsChemical image captioning models1
Chemical Protein Interaction Corpus (ChemProt) [288]2432 PubMed abstracts annotated with interactions.Chemical-protein relationships and advancing the performance of biomedical relation extraction algorithms1
ClashEval Drug Dosage [180]249 QA pairs on drug dosages with perturbed contexts.Benchmark precise dosage retrieval from text.1
ClashEval Locations [180]200 QA pairs asking for place names from entries.Test place-name retrieval under context errors.1
ClashEval Names [180]200 QA pairs querying two-word proper names.Benchmark proper-noun retrieval against noise.1
ClashEval News [180]238 numeric QA pairs from AP headline excerpts.Assess numerical answer extraction under noise.1
ClashEval Sports Records [180]191 QA pairs on Olympic-record tables with perturbations.Evaluate correct sports record retrieval.1
ClashEval Wikipedia Dates [180]200 QA pairs asking for four-digit years from text.Test year retrieval robustness under corruption.1
Clinical Practice Guidelines [289]Curated guideline articles from MEDITRON.Support clinical decision-making tasks.1
Code Refinement Dataset (CRD) [290]2.3 M bug-fix function pairs.Code repair and refinement.1
CodeMatcher [291]10.5 M Java methods paired with first doc sentence.Retrieve exemplar code snippets for generation.1
codeparrot/github-jupyter [292]165k Jupyter notebooks with metadataTrain code exemplar retrieval1
Cognitive Reviewer [139]Research PDFs analyzed and ranked for reviews.Facilitate literature reviews via RAG.1
ConceptNet [293]Multilingual commonsense KG with everyday concept triples.Augment LLM QA with retrieved commonsense subgraphs.1
Conceptual 12 M (CC12M) [294]12 M image–text pairs from the web.Pretrain vision-and-language models.1
Concode [295]100k train, 2k val/test of NL-to-Java examples.Generate code from natural language.1
Conference on Natural Language Learning 2003 (CoNLL03) [296]301k English/German tokens for NER.Named-entity recognition benchmark.1
Conference on Natural Language Learning 2004 (CoNLL04) [297]2k sentences for NER and SRL.Joint NER and semantic-role labeling.1
Conversation QA (QAConv) [298]10,259 conversations; 34,608 QA pairs.QA from informative multi-turn conversations.1
ConvFinQA (CFQA) [299]Financial QA grounded in tables and text, requiring math.Table comprehension and arithmetic in dialogues.1
Corpus for Enhancement of Lay Language Synthesis (CELLS) [141]62,886 abstract—lay summary pairs from biomedical journals.Simplify scientific text.1
COVID-19 Open Research Dataset (CORD19) [194]>140k articles on COVID-19, SARS, MERS (72k full-text).COVID-19 literature retrieval & QA.1
COYO-700M (COYO) [300]747 M image–text pairs with metadata.Support robust vision-language models.1
CREAK [301]Human-authored true/false entity claims.Fact-checking and commonsense reasoning.1
CrossCodeEval [302]Multilingual code completion benchmarks in four langs.Assess cross-language code completion generalization.1
CrossCodeLongEval [73]5k chunk + 5k function completions from 1500 repos.Evaluate large-span code completion.1
CSQA2.0 [303]Multiple-choice commonsense QA questions.Evaluate advanced commonsense reasoning.1
Curated Golden Evaluation [60]Standard queries with tickets and authoritative solutions.Benchmark retrieval and answer accuracy.1
CuratedTrec (CT) [304]867 open-domain factoid questions.Benchmark factoid QA systems.1
Current Events [206]910 multiple-choice questions from Aug–Nov 2023 U.S. news articles.Test LLM’s ability to learn new facts via fine-tuning/RAG.1
CXR-PRO [305]248,236 chest X-ray images with de-identified metadata.Support thoracic disease detection models.1
CyberAttack Sensing and Information Extraction (CASIE) [306]1000 English news articles on cybersecurity events.Extract cybersecurity event information.1
DailyDialog [307]13,118 daily-life multi-turn dialogues.Develop human-like conversational agents.1
Data Mining and Text Analytics Course Materials Corpus [66]500 pages of course textbooks, transcripts, figures.RAG-enabled Q&A and knowledge retrieval for course.1
De-identified electronic health records [172]2278 malnutrition-related clinical notesValidate summarization and extraction1
Defects for Java version 1.2 (Defects4J (v1.2)) [308]20,109 KLOC of Java code & tests with real bugs.Evaluate automated bug repair models.1
DialogSum [169]13k multi-speaker dialogues with human summaries.Evaluate conversational summarization.1
DigMinecraft [309]Images and step-by-step task instructionsMinecraft planning retrieval1
Discrete Reasoning Over Paragraphs (DROP) [310]96k questions requiring numeric and logical reasoning.Benchmark discrete reasoning in QA.1
Django [311]NL descriptions and Django implementation code.Evaluate NL-to-code generation on Django framework.1
Doc2Dial (D2D) [312]Document-grounded QA across four domains with long texts.Benchmark passage retrieval in conversational QA.1
DomainRAG [313]Multiple RAG sub-datasets (extractive, noisy, etc.).Benchmark domain-specific retrieval-augmented generation.1
DoQA [314]Conversational QA over cooking, travel, movie forums.Domain-specific dialogue QA with unanswerables.1
Drug-Drug Interactions (DDI) [315]1025 texts from Medline and DrugBank.Identify and classify drug interactions.1
Dynamed [316]Clinically organized summaries on 3200+ topics.Point-of-care clinical reference tool.1
EHRAgent [317]Four exemplar EHR cases + 700 patient “experience” records.Complex reasoning over EHR-based patient scenarios.1
Emotion-Specific Dialogue [318]Chinese dialogues annotated for five emotions.Train emotion-conditioned dialogue agents.1
EN.MC [319]229 multiple-choice QAs on novel contexts.Benchmark novel-based MCQA.1
En.QA [319]351 QAs on long novels (150k words context).Test QA over very long texts.1
Encyclopedic-VQA [320]221k image QA pairs linked to 16.7k entities.Knowledge-based visual question answering.1
EntityQuestion (EQ) [321]17,300 QA pairs on 24 relation typesAssess entity-centric knowledge retrieval1
European Association for the Study of the Liver Guidelines (EASL) [322]HCV screening, diagnosis, and treatment guidelines.Hepatology clinical decision support.1
Extreme Summarization (XSum) [323]226,711 news articles for single-sentence summaries.Support abstractive summarization models.1
Facebook Books [324]User–book interactions data.Research book recommendation systems.1
Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS) [325]87,026 claims with text and table evidence.Automate claim verification using text/tables.1
FAct Verification from Information-seeking Questions (FaVIQAmbig) [326]188,000 true/false claims from info-seeking queries.Generate and assess factual QA claims.1
FactKG [327]Claims aligned with knowledge-graph triples.Assess verification over structured KG.1
Factual Recall Questions [98]30 metadata-style queries (author, decision year, citation, etc.).Assess factual recall accuracy in legal RAG.1
FACTUALITYPROMPTS [328]Prompts targeting factual accuracy and entity hallucinationsEvaluate factual consistency in generation1
False Premise Questions [98]22 queries embedding legally incorrect assumptions.Probe AI’s handling of contrafactual legal prompts.1
Fermi [329]Estimation “Fermi problems.”Reason about numeric magnitudes and estimates.1
Fifty-Four Question-Answer Pairs for Few-Shot Learning [183]54 hepatologist-crafted QA examples.Evaluate few-shot learning in clinical scenarios.1
FinanceBench [330]80 docs, 141 finance QA questionsOpen-book financial QA1
Financial News [90]79k Chinese news articles with ChatGPT summaries.Improve summarization and market context knowledge.1
Financial Reports [90]120k equity research reports with same-day price data.Teach LLMs technical analysis and trend prediction.1
Financial Reports CoT [90]200 CoT annotations on financial report predictions.Teach rationale-rich stock movement predictions.1
FLAN [108]Natural language instructions for zero-shot learningBoost zero-shot performance and generalization1
FloodBrain ablation study dataset [186]26 paired human and FloodBrain flood reportsEvaluate pipeline component impact.1
FloodBrain evaluation dataset [186]10 human vs. 10 FloodBrain-generated flood reports.Compare generated vs. human summaries.1
FreebaseQA [331]28k trivia-style QAs mapped to Freebase entities.KB-grounded question answering.1
FreshQA [332]600 questions with rapidly changing answers.Test QA on dynamic answers needing external search.1
Gaokao-MM [333]646 MCQs across 8 subjects with 897 images.Test multimodal perception and reasoning.1
Gender-Specific Dialogue [334]Chinese dialogues labeled by speaker gender.Model gendered linguistic features.1
General Legal Research [98]80 open-ended legal research questions (common-law, bar exams, doctrine).Benchmark legal-AI retrieval for practicing attorneys.1
GIT [335]Biomedical triple-extraction dataset for non-drug therapies.Support biomedical relation extraction models.1
GIT Relation Extraction (GITRE) [335]Sentences with head/tail entities and relations.Predict relationships between biomedical entities.1
GPT-Generated Answer Evaluation Corpus [59]100 answers with TA and automated correctness labels.Quantify model factual accuracy metrics.1
GraphQA [48]Integrates ExplaGraphs, SceneGraphs, WebQSP into QA.Graph-based QA benchmark.1
GSM-HARD [336]GSM8K variant with larger numeric valuesTest arithmetic robustness1
GSM8K [337]8.5k grade-school math word problemsBenchmark multi-step math reasoning1
HANS [338]Heuristic-bias evaluation for NLI.Test NLI heuristic vulnerability.1
Harry Potter Series (Books3 subset) [339]Full text of seven books (1 M words).Study model memorization and extraction from training.1
Harvard Law Case Corpus [340]Extensive collection of Harvard Law case texts.Pretrain/fine-tune legal language models.1
Harvard-FairVLMed [341]Multimodal fundus images with associated textual data.Fairness evaluation in ophthalmic vision-language.1
HealthcareMagic-101 [342]200k doctor-patient medical dialoguesModel sensitive medical conversational contexts1
Hearthstone [343]Game-card logic code paired with card names.Benchmark NL-to-code on game logic generation.1
Historical Issue Tickets [60]Customer service tickets parsed into hierarchical trees.Improve retrieval/QA over support tickets.1
Hospital Neurology Discharge Summaries [165]100 anonymized neurology discharge summaries.Personalize advice and track recovery via memory.1
Human-Edited Counterfactuals Subset of IMDb [174]1.7K movie reviews manually sentiment-inverted.Augment data via sentiment counterfactuals.1
Human-Generated Responses [56]Free-text pre-op instructions by junior doctorsBaseline pre-op instruction generation1
HumanEval [344]164 Python programming tasks with unit tests.Evaluate code generation correctness.1
HumanEval+ [345]164 tasks with 80× more test casesRobustness evaluation for code generation1
HybriDialogue (HDial) [346]QA on hybrid pages (text + tables) in conversation.Mixed-modal conversational reasoning.1
IMDB (Internet Movie Database) [347]Subsets of movie reviews and associated metadata.Sentiment analysis and recommendation tasks.1
InferredBugs [129]6200 repos; 8280 bug-fix patches.Support models on static-analysis bug fixes.1
Infineon Developer Community Forum Questions [348]Technical Q&A with expert answers.Benchmark chatbot against forum solutions.1
Infineon Product Documents [349]Datasheets and product guides.Retrieval for technical RAG systems.1
InfoSeek [350]1.3 M image-QA triplets for 11k entities.Assess external knowledge integration in VQA.1
INSCIT [351]Under-specified Wikipedia QA requiring clarification.Test clarification question generation.1
IU-Xray [352]Chest X-rays paired with detailed diagnostic reports.Support medical image-reporting systems.1
Joint Research Centre Acquis (JRCAcquis) [353]8000 legal docs per language, 20+ EU languages.Multilingual legal parallel corpus.1
Jurisdiction or Time-Specific Research [98]70 questions on jurisdictional splits or overturned precedents.Test RAG on time-sensitive legal rule retrieval.1
Knowledge Intensive Language Tasks (KILT) [354]11 datasets for fact checking, QA, entity linking.Unified evaluation of knowledge-intensive tasks.1
Labeled EDGAR (LEDGAR) [193]846K contract provisions with 12.6K refined labels.Contract clause classification.1
Lambada [355]Cloze tasks requiring broad discourse context.Test long-range dependency in LMs.1
Language Model Personalization (LaMP) [356]Seven classification and generation tasks.Benchmark personalized model outputs.1
Lecture-Material [59]Lecture notes, slides, exercise sheets corpus.RAG retrieval for course-related queries.1
LegalBench Collection [357]50 manual legal QA pairs.Small-scale legal QA benchmarking.1
LightQA [140]QA from role-playing dialogues with final utterance.Evaluate factual QA in game dialogue contexts.1
LightWild [358]462K utterances across 41K RPG episodes.Support dialogue agents in fantasy settings.1
LiveQA [359]Real medical questions with long-form answers.Evaluate clinical long-answer generation.1
LLaVA-Instruct [102]158k image–instruction training pairs.Visual instruction tuning for MLLMs.1
Lumos-QG-Generated QA Dataset (9000 Pairs) [66]9000 auto-generated QA pairs from course materials.Expand knowledge base for Alexa skill and evaluation.1
lyft_2021 [360]Lyft 2021 document used for chunking benchmark queries.Benchmark document-chunking techniques.1
Massive Multi-discipline Multimodal Understanding (MMMU) [361]11.5k college-level multimodal exam questions.Expert-level multimodal reasoning evaluation.1
Math Nation Queries [154]51 factual/conceptual math questions from forum.Benchmark math QA from student discussions.1
MathVista [362]6141 math problems with diagrams, charts, plots.Evaluate multimodal math reasoning.1
Medical Transcription Samples (MTsample) [363]Transcriptions across 40+ clinical specialties.Research clinical text classification patterns.1
MedicationQA [364]Long-form QA focused on medication queries.Test medication-related answer accuracy.1
MedInstruct [365]Biomedical instructions: QA, summarization, MCQs.Fine-tune models on diverse clinical tasks.1
MedMCQA [366]Multiple-choice biomedical questionsBenchmark biomedical QA systems.1
MedQA [367]Multiple-choice medical exam questionsEvaluate medical QA models.1
MetaQA [368]400k questions covering single- and multi-hop reasoning.Test end-to-end KG QA systems.1
Microsoft COCO (MSCOCO) [369]328K images, 2.5 M labeled object instances.Scene understanding and object detection.1
Microsoft Research Paraphrase Corpus (MSRPC) [370]2.2k train, 550 val, 1.1k test paraphrase pairs.Evaluate paraphrase detection.1
Microsoft Research Video Description Corpus (MSVD) [371]1970 YouTube clips with 80k English descriptions.Benchmark video captioning models.1
Microsoft Research Video to Text (MSRVTT) [372]10,000 videos with 200k captions.Video captioning evaluation across domains.1
MIMIC-CXR [373]Large public CXR images with radiology reports.Develop chest X-ray interpretation models.1
Minecraft Wiki [374]Thousands of community-curated Minecraft articlesRetrieval for planning tasks1
Mintaka [375]Knowledge graph QA with complex, diverse questions.Knowledge graph QA benchmark.1
MMBench (MMB) [376]3k multiple-choice questions covering 20 abilities.Benchmark fine-grained multimodal capabilities.1
Mol-Instructions [377]Off-the-shelf biomedical instruction tasks.Instruction-tuning biomedical models.1
MongoDB-Logs (Chat & Cost) [59]Conversation logs and token-cost data.The logs underpin post hoc accuracy checks, cost calculations and support future optimisation of the chatbot service.1
MongoDB-QA (Question Answer Pairs) [59]170 validated course QA pairs.It is sampled by the QAGeneration-Chain to generate quick practice exercises for students.1
Mostly Basic Programming Problems (MBPP) [378]974 beginner Python problems with testsEvaluate beginner-level code models1
Mostly Basic Programming Problems+ (MBPP+) [345]MBPP tasks with added test casesEnhanced MBPP evaluation coverage1
MovieLens100K [379]100k movie ratings by various users.Benchmark recommendation algorithms.1
MS-CXR [380]1153 chest X-rays with paired radiology reports.Evaluation CXR interpretation and report models.1
Multi-Domain Wizard-of-Oz version 2.1 (MultiWOZ 2.1) [381]10,438 dialogs across seven domains with slots.Develop and benchmark multi-domain dialogue.1
Multi-Genre Natural Language Inference (MNLI) [382]433k sentence pairs labeled entailment/contradiction/neutrality.Evaluate natural language inference models.1
Multi-programming Language Commit Message (MCMD) [383]2.25 M commit messages across five programming languages.Evaluate semantic code search capabilities.1
Multi-Sentence Reading Comprehension (MultiRC) [384]800 paragraphs with 6000 multi-sentence questions.Evaluation comprehension over multi-sentence contexts.1
Multimodal Evaluation (MME) [385]14 tasks in cognition and perception categories.Standardized benchmark for multimodal LLMs.1
Natural Language to Bash (NL2Bash) [386]9000+ English descriptions paired with Bash commands.Translate natural language to shell commands.1
Natural Language to Command Line (NLC2CMD) [387]100 NL-to-command evaluation examples.Build NL-to-command translation systems.1
New York Times (NYT) [388]1.8 M articles published between 1987–2007.News summarization.1
NewsQA [389]119k QA pairs from 12.7k CNN news articles.Human-generated question-answer pairs developed from news articles from CNN1
NoCaps [390]15k images of novel objects without MSCOCO overlap.Evaluate novel-object captioning.1
North American HCV Guidelines [391]AASLD-IDSA supplemental HCV practice guidelines.Supplementary HCV clinical reference.1
Online Sources Nursing Knowledge JSON [165]Scraped nursing instructions and academic papers JSON.Supply RAG pipeline with clinical knowledge.1
OpenQA-NQ (subset of Natural Questions) [392]13 M evidence blocks from Wikipedia for QA retrieval.Open-retrieval question answering.1
OpenStax Prealgebra Textbook [393]Textbook sections on prealgebraThe content from the math textbook is used to generate responses to real student questions.1
OpenStreetMap Planet dump [394]Global vector map data: roads, buildings, POIs.Enrich flood maps with geographic data.1
Osaka Personal Activity Trajectory [164]2102 daily check-in trajectories, 537 synthetic samples.Evaluate mobility framework’s city generalization.1
ParaSCI-ACL [395]28,883 scientific paraphrase training examples.Scientific-domain paraphrase generation.1
Patient Inquiry Dataset [165]Timestamped patient questions during system testing.Evaluate conversational performance and short-term memory.1
Patient Symptom Record Dataset [165]Daily self-reported vital signs and symptom notes.Monitor condition changes and trigger alerts.1
PDFTriage (PDFT)Questions on PDF document structures.Benchmark document-structure QA tasks.1
PMC Full-text [396]Full-text articles from PubMed Central.Enable retrieval for biomedical question answering.1
Polling-based Object Probing Evaluation (POPE) [397]Binary yes/no questions from ground truth objects/negatives.Assess object hallucination in V-L models.1
Pre-training Corpus [398]330 B tokens from 15 high-quality sources.Pretrain RETRO and GPT language models.1
Probably-Asked Questions (PAQ) [399]65 M auto-generated QA pairsSemi-structured KB QA knowledge base.1
PTB-XL [400]21,837 12-lead ECG records with cardiologist annotations.Arrhythmia diagnosis and zero-shot eval.1
PTB-XL+ [401]Adds algorithm-extracted ECG features for each record.Detailed ECG feature analysis for diagnosis.1
PubMed Abstract [255]Corpus of PubMed abstracts.Provide domain evidence for QA retrieval.1
PwC Reading-Comprehension Corpus [402]241k passage-question-answer triples.Research on large-context compression.1
Python Code Summarization Dataset (PCSD) [403]150k function–docstring pairsCode summarization.1
PyTorrent [404]2 M Python methods from PyPI/Anaconda packages.Code exemplar retrieval for Python generation.1
QReCC [405]Open-domain conversational QA over web docs (avg 5K words).Zero-shot conversational retrieval and QA.1
QuAIL [406]15k multiple-choice questions across varied texts.Evaluate adaptive QA across question types.1
QuALITY [407]MCQs from stories/articles (multiple-choice).Narrative comprehension evaluation.1
QuaRTz [408]3864 MCQs on qualitative relationships.Semantic and linguistic reasoning in QA.1
Question Answering in Context (QuAC) [409]Multi-turn dialogues over Wikipedia with answerable turns.Conversational QA with linked long contexts.1
Quora Question Pairs 140K (QQP) [410]134k train, 5k val, 5k test paraphrase pairs.Paraphrase detection and generation.1
Quora Question Pairs 50K (QQP) [411]50k paraphrase question pairs.Paraphrase detection and generation.1
RACE [412]Exams-derived reading comprehension dataset.Benchmark multi-paragraph comprehension.1
RAG Comparison (Derived from the SPOKE KG) [57]Biomedical questions from SPOKE KG entity associations.Compare RAG: KG, Cypher, full-text methods.1
RAG-Fusion Query Set [131]Dynamically generated multi-query sets.Enhance retrieval via rank fusion.1
RAGTruth [413]18,000 LLM-generated responses with quality labels.Benchmark hallucination detection in RAG.1
Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) [414]70k passages, 120k queriesCommonsense reading comprehension1
REALTOXICITYPROMPTS [415]Prompts engineered to elicit toxic language.Evaluate worst-case toxicity in outputs.1
Reddit Webis-TLDR-17 [416]Reddit posts paired with short summariesTest summarization with varied tones1
ReliefWeb flood reports [186]Human-authored situational flood event reports.Benchmark report factual accuracy.1
Research Dataset [90]42k finance texts merging sentiment, numeric, headline tasks.Pretrain/fine-tune LLMs on financial language.1
Retrieval-Augmented Generation Benchmark (RGB) [178]1000 English & Chinese QAEvaluate retrieval-augmented generation1
RiddleSense [417]5000 riddles with answer options requiring creative reasoning.Challenge models on linguistic creativity and commonsense.1
Roles Across Multiple Sentences (RAMS) [418]3993 docs, 9124 event annotationsMulti-sentence semantic role labeling1
RTLLM [419]RTL generation benchmark tasks.Evaluate LLM-based RTL design generation.1
SamSum [420]16k messenger-style dialogues with abstractive summaries.Train dialogue summarization systems.1
SBU Captions (SBU) [421]1 M Flickr-based image–caption pairs.Large-scale image captioning research.1
SceneGraphs (from GQA) [422]100k scene graphs of images for visual reasoning.Support spatial and visual inference tasks.1
Scoliosis Research Society (SRS) [423]Educational, research, patient resourcesSupport spinal deformity care1
SearchQA [424]140k QA pairs, 6.9 M snippetsQA simulating real web search1
Self-Instruct [425]LM-generated instruction examplesSupport models on diverse self-generated directives1
Sentiment-Specific Dialogue [184]English dialogues labeled by sentiment.Generate sentiment-controlled responses.1
ServiceNow Internal Data [145]Annotated queries with structured workflow JSON.Translate NL requests into workflows.1
SocialIQA (SIQA) [426]38,000 social-context multiple-choice QA pairs.Test commonsense reasoning in social contexts.1
SODA [427]High-quality social dialogue examplesEnhance conversational fine-tuning1
SQA [428]Conversational QA over single Wikipedia tablesCompositional multi-column table QA.1
SQuAD v2 [225]150k QAs plus 50k unanswerable questions on Wikipedia.QA with answer/no-answer classification.1
Stanford Sentiment Treebank (SST2) [429]215k phrases labeled for fine-grained sentiment.Benchmark sentiment classification1
StockQA [90]21k Chinese QA pairs from real stock-price sequences.Train time-series reasoning for investor queries.1
TACRED [430]Adapted TACRED for zero/few-shot slot filling (41 types).Benchmark relation extraction and slot filling.1
TAM Questionnaire Response Set [59]30 students’ Likert-scale survey responses.Evaluate user acceptance via factor/regression.1
TFix [431]100k code error-fix pairsEvaluate code repair models1
The human cost of disasters (2000–2019) [432]Global disaster human-impact records 2000–2019.Analyze flood impacts for planning.1
The Pile [339]825 GiB text from 22 sourcesPretrain diverse language models1
The Stack [433]3 TB public source code from GitHub.Pretrain and fine-tune code language models.1
Tokyo Personal Activity Trajectory [164]100 users’ time-ordered GPS check-ins (2019–2022).Model realistic human mobility patterns.1
ToolQA [434]Personal-agenda questions assessing external tool use.Measure LLM integration of external tools in QA.1
TopiOCQA (TCQA) [435]QA over full Wikipedia with topic shifts.Evaluate topic-transition conversational QA.1
TREC-COVID [436]Dynamic COVID-19 docs with topics and relevance labels.Pandemic literature retrieval evaluation.1
True/False dataset [57]True/false statements on gene-disease and drug-disease.Benchmark biomedical assertion verification.1
UltraDomain—Agriculture [437]2,017,886 tokens from 12 college-agriculture textsEvaluate RAG’s sense-making in agriculture domain1
UltraDomain—CS [437]2,306,535 tokens from 10 computer-science textsTest RAG on technical computer-science content1
UltraDomain—Legal [437]5,081,069 tokens from 94 legal textbook documentsBenchmark RAG on complex legal language and reasoning1
UltraDomain—Mixed [437]619,009 tokens across 61 humanities textsChallenge RAG with heterogeneous humanities content1
Unnatural Instructions [438]Minimally human-curated challenging instructionsAugment instruction tuning diversity1
UpToDateClinical decision support content by Wolters Kluwer.Point-of-care medical reference.1
VATEX [439]25,991 train, 9k val/test English video captions.Multilingual and multi-modal captioning.1
VerilogEval [440]Verilog code generation tasks.Assess LLM Verilog functional correctness.1
VerilogEval-syntax [147]200+ clustered Verilog syntax error examples.Test syntax-error correction in Verilog.1
Visual Question Answering (VQA) [441]254,721 images with 760k questions and 10 M answers.Visual QA tasks combining vision and language.1
W3C-EmailEmails similar to GPT-Neo’s training distributionStudy retrieval-augmented memorization effects1
Web Search--1
WebQA [442]34,200 train, 5000 val, 7500 test QA pairs; 390k images.Multimodal web-based QA benchmarking.1
Weibo [443]4.4 M post-response pairs from Sina Weibo.Support short-text conversation models.1
WikiPassageQA [444]4165 QA with long answer passages.Reading comprehension with long answers.1
Wikipedia (October 2017) [196]Snapshot of English Wikipedia articles.Historic Wikipedia text for NLP.1
Wikipedia Evaluation (WikiEval) [445]50 Wikipedia pages covering diverse topics.Evaluate retrieval-augmented systems.1
Wikipedia Passages [446]6 M+ articles, 3.8 B words across languages (as of 2021).Large-scale text corpus for NLP.1
WinoGrande [447]Pronoun-resolution tasks in complex contexts.Assess coreference resolution capability.1
WitQA [448]14k factual QA pairs on 32 relation typesEvaluate factual QA across relations1
Wizard of the Internet (WizInt) [45]9633 dialogues, 93,665 utterances, 29,500 URLs.Dialogue with live internet search.1
WNED [449]320 documents with 6821 linkable mentions.Evaluate entity linking systems.1
Word-in-Context (WiC) [450]Word-in-context disambiguation pairs.Evaluate word sense disambiguation.1
Worker and AI Collaboration for Natural Language Inference (WaNLI) [451]107,885 NLI examples combining human and GPT-3 data.Natural language inference with AI mix.1
Yelp Reviews [452]1.1 M+ reviews, 42k businesses, 400k tips, check-ins.Recommendation and sentiment analysis.1
Yelp. 2021 [453]Business attributes and reviews with detailed schema.Data-to-text generation and hallucination tests.1
ZINC-15 [454]1.54 B filtered SMILES stringsVirtual screening compound datasets1

References

  1. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Kuttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2020, arXiv:2005.11401. [Google Scholar]
  2. Li, H.; Su, Y.; Cai, D.; Wang, Y.; Liu, L. A Survey on Retrieval-Augmented Text Generation. arXiv 2022, arXiv:2202.01110. [Google Scholar] [CrossRef]
  3. Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
  4. Gupta, S.; Ranjan, R.; Narayan Singh, S. A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions. arXiv 2024, arXiv:2410.12837. [Google Scholar] [CrossRef]
  5. Wu, S.; Xiong, Y.; Cui, Y.; Wu, H.; Chen, C.; Yuan, Y.; Huang, L.; Liu, X.; Kuo, T.W.; Guan, N.; et al. Retrieval-Augmented Generation for Natural Language Processing: A Survey. arXiv 2024, arXiv:2407.13193. [Google Scholar] [CrossRef]
  6. Arslan, M.; Ghanem, H.; Munawar, S.; Cruz, C. A Survey on RAG with LLMs. Procedia Comput. Sci. 2024, 246, 3781–3790. [Google Scholar] [CrossRef]
  7. Fan, W.; Ding, Y.; Ning, L.; Wang, S.; Li, H.; Yin, D.; Chua, T.S.; Li, Q. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024. [Google Scholar] [CrossRef]
  8. Cheng, M.; Luo, Y.; Ouyang, J.; Liu, Q.; Liu, H.; Li, L.; Yu, S.; Zhang, B.; Cao, J.; Ma, J.; et al. A Survey on Knowledge-Oriented Retrieval-Augmented Generation. arXiv 2025, arXiv:2503.10677. [Google Scholar] [CrossRef]
  9. Arslan, M.; Munawar, S.; Cruz, C. Business insights using RAG–LLMs: A review and case study. J. Decis. Syst. 2024, 1–30. [Google Scholar] [CrossRef]
  10. Hindi, M.; Mohammed, L.; Maaz, O.; Alwarafy, A. Enhancing the Precision and Interpretability of Retrieval-Augmented Generation (RAG) in Legal Technology: A Survey. IEEE Access 2025, 13, 46171–46189. [Google Scholar] [CrossRef]
  11. Huang, Y.; Huang, J. A Survey on Retrieval-Augmented Text Generation for Large Language Models. arXiv 2024, arXiv:2404.10981. [Google Scholar] [CrossRef]
  12. Zhao, S.; Yang, Y.; Wang, Z.; He, Z.; Qiu, L.K.; Qiu, L. Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely. arXiv 2024, arXiv:2409.14924. [Google Scholar] [CrossRef]
  13. Verma, S. Contextual Compression in Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2024, arXiv:2409.13385. [Google Scholar] [CrossRef]
  14. Zhao, P.; Zhang, H.; Yu, Q.; Wang, Z.; Geng, Y.; Fu, F.; Yang, L.; Zhang, W.; Jiang, J.; Cui, B. Retrieval-Augmented Generation for AI-Generated Content: A Survey. arXiv 2024, arXiv:2402.19473. [Google Scholar] [CrossRef]
  15. Singh, A.; Ehtesham, A.; Kumar, S.; Talaei Khoei, T. Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. arXiv 2025, arXiv:2501.09136. [Google Scholar] [CrossRef]
  16. Peng, B.; Zhu, Y.; Liu, Y.; Bo, X.; Shi, H.; Hong, C.; Zhang, Y.; Tang, S. Graph Retrieval-Augmented Generation: A Survey. arXiv 2024, arXiv:2408.08921. [Google Scholar] [CrossRef]
  17. Procko, T.T.; Ochoa, O. Graph Retrieval-Augmented Generation for Large Language Models: A Survey. In Proceedings of the 2024 Conference on AI, Science, Engineering, and Technology (AIxSET), Laguna Hills, CA, USA, 30 September–2 October, 2024; pp. 166–169. [Google Scholar] [CrossRef]
  18. Zhang, Q.; Chen, S.; Bei, Y.; Yuan, Z.; Zhou, H.; Hong, Z.; Dong, J.; Chen, H.; Chang, Y.; Huang, X. A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models. arXiv 2025, arXiv:2501.13958. [Google Scholar] [CrossRef]
  19. Mahdi Abootorabi, M.; Zobeiri, A.; Dehghani, M.; Mohammadkhani, M.; Mohammadi, B.; Ghahroodi, O.; Soleymani Baghshah, M.; Asgari, E. Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation. arXiv 2025, arXiv:2502.08826. [Google Scholar] [CrossRef]
  20. Zheng, X.; Weng, Z.; Lyu, Y.; Jiang, L.; Xue, H.; Ren, B.; Paudel, D.; Sebe, N.; Van Gool, L.; Hu, X. Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook. arXiv 2025, arXiv:2503.18016. [Google Scholar] [CrossRef]
  21. Simon, K.; Oğuz, C.; Leonid, K.; Muhammad, A.; Saara, A.; Selvine, M.; Daniel, G. Benchmarking of Retrieval Augmented Generation: A Comprehensive Systematic Literature Review on Evaluation Dimensions, Evaluation Metrics and Datasets. In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, Porto, Portugal, 17–19 November 2024. [Google Scholar] [CrossRef]
  22. Yu, H.; Gan, A.; Zhang, K.; Tong, S.; Liu, Q.; Liu, Z. Evaluation of Retrieval-Augmented Generation: A Survey. arXiv 2024, arXiv:2405.07437. [Google Scholar] [CrossRef]
  23. Zhou, Y.; Liu, Y.; Li, X.; Jin, J.; Qian, H.; Liu, Z.; Li, C.; Dou, Z.; Ho, T.Y.; Yu, P.S. Trustworthiness in Retrieval-Augmented Generation Systems: A Survey. arXiv 2024, arXiv:2409.10102. [Google Scholar] [CrossRef]
  24. Ni, B.; Liu, Z.; Wang, L.; Lei, Y.; Zhao, Y.; Cheng, X.; Zeng, Q.; Dong, L.; Xia, Y.; Kenthapadi, K.; et al. Towards Trustworthy Retrieval Augmented Generation for Large Language Models: A Survey. arXiv 2025, arXiv:2502.06872. [Google Scholar] [CrossRef]
  25. Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. Syst. Rev. 2021, 10, 89. [Google Scholar] [CrossRef]
  26. Kitchenham, B.; Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; Keele University: Keele, UK, 2007; Volume 2. [Google Scholar]
  27. Sidiropoulos, G.; Kanoulas, E. Analysing the Robustness of Dual Encoders for Dense Retrieval Against Misspellings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022. [Google Scholar] [CrossRef]
  28. Kuratov, Y.; Bulatov, A.; Anokhin, P.; Rodkin, I.; Sorokin, D.; Sorokin, A.; Burtsev, M. BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack. arXiv 2024, arXiv:2406.10149. [Google Scholar] [CrossRef]
  29. Alaofi, M.; Arabzadeh, N.; Clarke, C.L.A.; Sanderson, M. Generative Information Retrieval Evaluation. arXiv 2024, arXiv:2404.08137. [Google Scholar] [CrossRef]
  30. Kumar, Y.; Marttinen, P. Improving Medical Multi-modal Contrastive Learning with Expert Annotations. In Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature: Milan, Italy, 2024; pp. 468–486. [Google Scholar]
  31. Wang, M.; Chen, L.; Cheng, F.; Liao, S.; Zhang, X.; Wu, B.; Yu, H.; Xu, N.; Zhang, L.; Luo, R.; et al. Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 5627–5646. [Google Scholar] [CrossRef]
  32. Wu, J.; Zhu, J.; Qi, Y.; Chen, J.; Xu, M.; Menolascina, F.; Grau, V. Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation. arXiv 2024, arXiv:2408.04187. [Google Scholar] [CrossRef]
  33. Zheng, L.; Yin, L.; Xie, Z.; Sun, C.; Huang, J.; Hao Yu, C.; Cao, S.; Kozyrakis, C.; Stoica, I.; Gonzalez, J.E.; et al. SGLang: Efficient Execution of Structured Language Model Programs. arXiv 2023, arXiv:2312.07104. [Google Scholar] [CrossRef]
  34. Arora, N.; Chakraborty, I.; Nishimura, Y. AI–Human Hybrids for Marketing Research: Leveraging Large Language Models (LLMs) as Collaborators. J. Mark. 2025, 89, 43–70. [Google Scholar] [CrossRef]
  35. Luu, R.K.; Buehler, M.J. BioinspiredLLM: Conversational Large Language Model for the Mechanics of Biological and Bio-Inspired Materials. Adv. Sci. 2024, 11, 2306724. [Google Scholar] [CrossRef]
  36. Zhang, B.; Soh, H. Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 9820–9836. [Google Scholar] [CrossRef]
  37. Liu, S.; Cheng, H.; Liu, H.; Zhang, H.; Li, F.; Ren, T.; Zou, X.; Yang, J.; Su, H.; Zhu, J.; et al. LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents. arXiv 2023, arXiv:2311.05437. [Google Scholar] [CrossRef]
  38. Gebreab, S.A.; Salah, K.; Jayaraman, R.; Rehman, M.H.u.; Ellaham, S. LLM-Based Framework for Administrative Task Automation in Healthcare. In Proceedings of the 2024 12th International Symposium on Digital Forensics and Security (ISDFS), San Antonio, TX, USA, 29–30 April 2024; pp. 1–7. [Google Scholar] [CrossRef]
  39. Loukas, L.; Stogiannidis, I.; Diamantopoulos, O.; Malakasiotis, P.; Vassos, S. Making LLMs Worth Every Penny: Resource-Limited Text Classification in Banking. In Proceedings of the Fourth ACM International Conference on AI in Finance, Brooklyn, NY, USA, 27–29 November 2023. [Google Scholar] [CrossRef]
  40. Buehler, M.J. MechGPT, a Language-Based Strategy for Mechanics and Materials Modeling That Connects Knowledge Across Scales, Disciplines, and Modalities. Appl. Mech. Rev. 2024, 76, 021001. [Google Scholar] [CrossRef]
  41. Chen, J.; Zhang, R.; Guo, J.; de Rijke, M.; Chen, W.; Fan, Y.; Cheng, X. Continual Learning for Generative Retrieval over Dynamic Corpora. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023. [Google Scholar] [CrossRef]
  42. Mao, Y.; He, P.; Liu, X.; Shen, Y.; Gao, J.; Han, J.; Chen, W. Generation-augmented retrieval for open-domain question answering. In Proceedings of the ACL-IJCNLP 2021—59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Bangkok, Thailand, 1–6 August 2021; pp. 4089–4100. [Google Scholar]
  43. Ram, O.; Levine, Y.; Dalmedigos, I.; Muhlgay, D.; Shashua, A.; Leyton-Brown, K.; Shoham, Y. In-Context Retrieval-Augmented Language Models. Trans. Assoc. Comput. Linguist. 2023, 11, 1316–1331. [Google Scholar] [CrossRef]
  44. Xu, P.; Ping, W.; Wu, X.; McAfee, L.; Zhu, C.; Liu, Z.; Subramanian, S.; Bakhturina, E.; Shoeybi, M.; Catanzaro, B. Retrieval meets Long Context Large Language Models. arXiv 2023, arXiv:2310.03025. [Google Scholar] [CrossRef]
  45. Komeili, M.; Shuster, K.; Weston, J. Internet-Augmented Dialogue Generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1, pp. 8460–8478. [Google Scholar]
  46. Yan, S.Q.; Gu, J.C.; Zhu, Y.; Ling, Z.H. Corrective Retrieval Augmented Generation. arXiv 2024, arXiv:2401.15884. [Google Scholar] [CrossRef]
  47. Wang, Y.; Lipka, N.; Rossi, R.A.; Siu, A.; Zhang, R.; Derr, T. Knowledge Graph Prompting for Multi-Document Question Answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Wooldridge, M., Dy, J., Natarajan, S., Eds.; Volume 38, pp. 19206–19214. [Google Scholar] [CrossRef]
  48. He, X.; Tian, Y.; Sun, Y.; Chawla, N.V.; Laurent, T.; LeCun, Y.; Bresson, X.; Hooi, B. G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering. arXiv 2024, arXiv:2402.07630. [Google Scholar] [CrossRef]
  49. Shao, Z.; Gong, Y.; Shen, Y.; Huang, M.; Duan, N.; Chen, W. Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. arXiv 2023, arXiv:2305.15294. [Google Scholar] [CrossRef]
  50. Zhang, F.; Chen, B.; Zhang, Y.; Keung, J.; Liu, J.; Zan, D.; Mao, Y.; Lou, J.G.; Chen, W. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. In Proceedings of the EMNLP 2023—2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics (ACL): Singapore, 2023; pp. 2471–2484. [Google Scholar] [CrossRef]
  51. Liu, S.; Chen, Y.; Xie, X.; Siow, J.; Liu, Y. Retrieval-augmented generation for code summarization via hybrid GNN. In Proceedings of the ICLR 2021—9th International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
  52. Yasunaga, M.; Aghajanyan, A.; Shi, W.; James, R.; Leskovec, J.; Liang, P.; Lewis, M.; Zettlemoyer, L.; Yih, W.T. Retrieval-Augmented Multimodal Language Modeling. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 39755–39769. [Google Scholar]
  53. Gui, L.; Wang, B.; Huang, Q.; Hauptmann, A.; Bisk, Y.; Gao, J. KAT: A Knowledge Augmented Transformer for Vision-and-Language. In Proceedings of the NAACL 2022—2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; Association for Computational Linguistics (ACL): Seattle, WA, USA, 2022; pp. 956–968. [Google Scholar]
  54. Glass, M.; Rossiello, G.; Chowdhury, M.F.M.; Gliozzo, A. Robust Retrieval Augmented Generation for Zero-shot Slot Filling. In Proceedings of the EMNLP 2021—2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 1939–1949. [Google Scholar]
  55. Sachan, D.S.; Reddy, S.; Hamilton, W.; Dyer, C.; Yogatama, D. End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; Volume 31, pp. 25968–25981. [Google Scholar]
  56. Ke, Y.; Jin, L.; Elangovan, K.; Rizal Abdullah, H.; Liu, N.; Sia, A.T.H.; Soh, C.R.; Tung, J.Y.M.; Ong, J.C.L.; Ting, D.S.W. Development and Testing of Retrieval Augmented Generation in Large Language Models—A Case Study Report. arXiv 2024, arXiv:2402.01733. [Google Scholar] [CrossRef]
  57. Soman, K.; Rose, P.W.; Morris, J.H.; Akbas, R.E.; Smith, B.; Peetoom, B.; Villouta-Reyes, C.; Cerono, G.; Shi, Y.; Rizk-Jackson, A.; et al. Biomedical knowledge graph-optimized prompt generation for large language models. Bioinformatics 2024, 40, btae560. [Google Scholar] [CrossRef]
  58. Chen, W.; Hu, H.; Chen, X.; Verga, P.; Cohen, W.W. MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text. arXiv 2022, arXiv:2210.02928. [Google Scholar] [CrossRef]
  59. Neumann, A.T.; Yin, Y.; Sowe, S.; Decker, S.; Jarke, M. An LLM-Driven Chatbot in Higher Education for Databases and Information Systems. IEEE Trans. Educ. 2025, 68, 103–116. [Google Scholar] [CrossRef]
  60. Xu, Z.; Jerome Cruz, M.; Guevara, M.; Wang, T.; Deshpande, M.; Wang, X.; Li, Z. Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering. arXiv 2024, arXiv:2404.17723. [Google Scholar] [CrossRef]
  61. Hoshi, Y.; Miyashita, D.; Ng, Y.; Tatsuno, K.; Morioka, Y.; Torii, O.; Deguchi, J. RALLE: A Framework for Developing and Evaluating Retrieval-Augmented Large Language Models. In Proceedings of the EMNLP 2023—2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Singapore, 6–10 December 2023; Feng, Y., Lefever, E., Eds.; Association for Computational Linguistics (ACL): Singapore, 2023; pp. 52–69. [Google Scholar]
  62. Jiang, W.; Zhang, S.; Han, B.; Wang, J.; Wang, B.; Kraska, T. PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design. arXiv 2024, arXiv:2403.05676. [Google Scholar] [CrossRef]
  63. Caffagni, D.; Cocchi, F.; Moratelli, N.; Sarto, S.; Cornia, M.; Baraldi, L.; Cucchiara, R. Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–18 June 2024; pp. 1818–1826. [Google Scholar] [CrossRef]
  64. Guo, Y.; Li, Z.; Jin, X.; Liu, Y.; Zeng, Y.; Liu, W.; Li, X.; Yang, P.; Bai, L.; Guo, J.; et al. Retrieval-Augmented Code Generation for Universal Information Extraction. arXiv 2023, arXiv:2311.02962. [Google Scholar] [CrossRef]
  65. Xiong, G.; Jin, Q.; Lu, Z.; Zhang, A. Benchmarking Retrieval-Augmented Generation for Medicine. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 6233–6251. [Google Scholar] [CrossRef]
  66. Alsafari, B.; Atwell, E.; Walker, A.; Callaghan, M. Towards effective teaching assistants: From intent-based chatbots to LLM-powered teaching assistants. Nat. Lang. Process. J. 2024, 8, 100101. [Google Scholar] [CrossRef]
  67. Yu, C.; Yang, G.; Chen, X.; Liu, K.; Zhou, Y. Bashexplainer: Retrieval-augmented bash code comment generation based on fine-tuned codebert. In Proceedings of the 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME), Limassol, Cyprus, 3–7 October 2022; pp. 82–93. [Google Scholar]
  68. Guo, T.; Yang, Q.; Wang, C.; Liu, Y.; Li, P.; Tang, J.; Li, D.; Wen, Y. KnowledgeNavigator: Leveraging large language models for enhanced reasoning over knowledge graph. Complex Intell. Syst. 2024, 10, 7063–7076. [Google Scholar] [CrossRef]
  69. Wiratunga, N.; Abeyratne, R.; Jayawardena, L.; Martin, K.; Massie, S.; Nkisi-Orji, I.; Weerasinghe, R.; Liret, A.; Fleisch, B. CBR-RAG: Case-Based Reasoning for Retrieval Augmented Generation in LLMs for Legal Question Answering. In Case-Based Reasoning Research and Development; Springer Nature: Cham, Switzerland, 2024; pp. 445–460. [Google Scholar]
  70. Li, M.; Kilicoglu, H.; Xu, H.; Zhang, R. BiomedRAG: A retrieval augmented large language model for biomedicine. J. Biomed. Inform. 2025, 162, 104769. [Google Scholar] [CrossRef]
  71. Zhang, R.; Du, H.; Liu, Y.; Niyato, D.; Kang, J.; Sun, S.; Shen, X.; Poor, H.V. Interactive AI with Retrieval-Augmented Generation for Next Generation Networking. IEEE Netw. 2024, 38, 414–424. [Google Scholar] [CrossRef]
  72. Guo, Z.; Xia, L.; Yu, Y.; Ao, T.; Huang, C. LightRAG: Simple and Fast Retrieval-Augmented Generation. arXiv 2024, arXiv:2410.05779. [Google Scholar] [CrossRef]
  73. Wu, D.; Ahmad, W.U.; Zhang, D.; Krishna Ramanathan, M.; Ma, X. Repoformer: Selective Retrieval for Repository-Level Code Completion. arXiv 2024, arXiv:2403.10059. [Google Scholar] [CrossRef]
  74. Chen, Z.; Xiang, Z.; Xiao, C.; Song, D.; Li, B. AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases. arXiv 2024, arXiv:2407.12784. [Google Scholar] [CrossRef]
  75. Ren, Y.; Cao, Y.; Guo, P.; Fang, F.; Ma, W.; Lin, Z. Retrieve-and-Sample: Document-level Event Argument Extraction via Hybrid Retrieval Augmentation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 293–306. [Google Scholar] [CrossRef]
  76. Chowdhury, J.R.; Zhuang, Y.; Wang, S. Novelty Controlled Paraphrase Generation with Retrieval Augmented Conditional Prompt Tuning. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022, Virtual, 22 February–1 March 2022; Volume 36, pp. 10535–10544. [Google Scholar]
  77. Zhang, Z.; Fang, M.; Chen, L. RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering. arXiv 2024, arXiv:2402.16457. [Google Scholar] [CrossRef]
  78. Soong, D.; Sridhar, S.; Si, H.; Wagner, J.S.; Sá, A.C.C.; Yu, C.Y.; Karagoz, K.; Guan, M.; Kumar, S.; Hamadeh, H.; et al. Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model. PLoS Digit Health 2024, 3, e0000568. [Google Scholar] [CrossRef] [PubMed]
  79. Jin, C.; Zhang, Z.; Jiang, X.; Liu, F.; Liu, X.; Liu, X.; Jin, X. RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation. arXiv 2024, arXiv:2404.12457. [Google Scholar] [CrossRef]
  80. Wang, W.; Wang, Y.; Joty, S.; Hoi, S.C. RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 3–9 December 2023; pp. 146–158. [Google Scholar]
  81. Sawarkar, K.; Mangal, A.; Solanki, S.R. Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers. In Proceedings of the 2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, 7–9 August 2024; pp. 155–161. [Google Scholar] [CrossRef]
  82. Ramos, R.; Elliott, D.; Martins, B. Retrieval-augmented Image Captioning. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics; Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; pp. 3666–3681. [Google Scholar] [CrossRef]
  83. Yang, Z.; Ping, W.; Liu, Z.; Korthikanti, V.; Nie, W.; Huang, D.A.; Fan, L.; Yu, Z.; Lan, S.; Li, B.; et al. Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 11844–11857. [Google Scholar]
  84. Chen, J.; Pan, Y.; Li, Y.; Yao, T.; Chao, H.; Mei, T. Retrieval Augmented Convolutional Encoder-Decoder Networks for Video Captioning. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–24. [Google Scholar] [CrossRef]
  85. Tian, Y.; Song, H.; Wang, Z.; Wang, H.; Hu, Z.; Wang, F.; Chawla, N.V.; Xu, P. Graph Neural Prompting with Large Language Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Wooldridge, M., Dy, J., Natarajan, S., Eds.; Association for the Advancement of Artificial Intelligence: Vancouver, BC, Canada, 2024; Volume 38, pp. 19080–19088. [Google Scholar] [CrossRef]
  86. Lin, W.; Byrne, B. Retrieval Augmented Visual Question Answering with Outside Knowledge. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 11238–11254. [Google Scholar]
  87. Hofstätter, S.; Chen, J.; Raman, K.; Zamani, H. FiD-Light: Efficient and Effective Retrieval-Augmented Text Generation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023. [Google Scholar] [CrossRef]
  88. Feng, Z.; Feng, X.; Zhao, D.; Yang, M.; Qin, B. Retrieval-generation synergy augmented large language models. arXiv 2023, arXiv:2310.05149. [Google Scholar] [CrossRef]
  89. Jeong, C. A Study on the Implementation of Generative AI Services Using an Enterprise Data-Based LLM Application Architecture. arXiv 2023, arXiv:2309.01105. [Google Scholar] [CrossRef]
  90. Li, X.; Li, Z.; Shi, C.; Xu, Y.; Du, Q.; Tan, M.; Huang, J. AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 773–783. [Google Scholar]
  91. Xia, P.; Zhu, K.; Li, H.; Zhu, H.; Li, Y.; Li, G.; Zhang, L.; Yao, H. RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 1081–1093. [Google Scholar] [CrossRef]
  92. Sarto, S.; Cornia, M.; Baraldi, L.; Cucchiara, R. Retrieval-Augmented Transformer for Image Captioning. In Proceedings of the 19th International Conference on Content-Based Multimedia Indexing, Graz, Austria, 14–16 September 2022. [Google Scholar] [CrossRef]
  93. Siriwardhana, S.; Weerasekera, R.; Wen, E.; Kaluarachchi, T.; Rana, R.; Nanayakkara, S. Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering. arXiv 2022, arXiv:2210.02627. [Google Scholar] [CrossRef]
  94. Izacard, G.; Grave, E. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume; Association for Computational Linguistics, Virtual, 19–23 April 2021; pp. 874–880. [Google Scholar] [CrossRef]
  95. Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; van den Driessche, G.; Lespiau, J.B.; Damoc, B.; Clark, A.; et al. Improving language models by retrieving from trillions of tokens. arXiv 2021, arXiv:2112.04426. [Google Scholar] [CrossRef]
  96. Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv 2023, arXiv:2310.11511. [Google Scholar] [CrossRef]
  97. Luo, R.; Sun, L.; Xia, Y.; Qin, T.; Zhang, S.; Poon, H.; Liu, T.Y. BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinform. 2022, 23, bbac409. [Google Scholar] [CrossRef]
  98. Magesh, V.; Surani, F.; Dahl, M.; Suzgun, M.; Manning, C.D.; Ho, D.E. Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. arXiv 2024, arXiv:2405.20362. [Google Scholar] [CrossRef]
  99. Wang, Y.; Wang, W.; Joty, S.; Hoi, S.C. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 8696–8708. [Google Scholar] [CrossRef]
  100. Pearce, H.; Ahmad, B.; Tan, B.; Dolan-Gavitt, B.; Karri, R. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. arXiv 2021, arXiv:2108.09293. [Google Scholar] [CrossRef]
  101. Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv 2023, arXiv:2304.10592. [Google Scholar] [CrossRef]
  102. Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. arXiv 2023, arXiv:2304.08485. [Google Scholar] [CrossRef] [PubMed]
  103. Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar] [CrossRef]
  104. Anthropic. Chat with Claude. 2024. Available online: https://claude.ai/chats (accessed on 14 May 2025).
  105. Workshop, B.; Le Scao, T.; Fan, A.; Akiki, C.; Pavlick, E.; Ilić, S.; Hesslow, D.; Castagné, R.; Sasha Luccioni, A.; Yvon, F.; et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv 2022, arXiv:2211.05100. [Google Scholar] [CrossRef]
  106. DeepSeek-AI; Liu, A.; Feng, B.; Wang, B.; Wang, B.; Liu, B.; Zhao, C.; Dengr, C.; Ruan, C.; Dai, D.; et al. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv 2024, arXiv:2405.04434. [Google Scholar] [CrossRef]
  107. Wang, B.; Komatsuzaki, A. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. 2021. Available online: https://github.com/kingoflolz/mesh-transformer-jax (accessed on 14 May 2025).
  108. Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling Instruction-Finetuned Language Models. arXiv 2022, arXiv:2210.11416. [Google Scholar] [CrossRef]
  109. Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; et al. PaLM 2 Technical Report. arXiv 2023, arXiv:2305.10403. [Google Scholar] [CrossRef]
  110. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics, Virtual, 5–10 July 2020; pp. 7871–7880. [Google Scholar] [CrossRef]
  111. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
  112. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
  113. TheBloke. Llama 2 70B Chat—AWQ. 2023. Available online: https://huggingface.co/TheBloke/Llama-2-70B-Chat-AWQ (accessed on 14 May 2025).
  114. Ai@Meta. Llama 3 Model Card. 2024. Available online: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md (accessed on 14 May 2025).
  115. Meta AI. Introducing Llama 3.1: Our Most Capable Models to Date. 2024. Available online: https://ai.meta.com/blog/meta-llama-3-1/ (accessed on 14 May 2025).
  116. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Singh Chaplot, D.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
  117. Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Singh Chaplot, D.; de las Casas, D.; Hanna, E.B.; Bressand, F.; et al. Mixtral of Experts. arXiv 2024, arXiv:2401.04088. [Google Scholar] [CrossRef]
  118. Nomic, A.I. GPT4All: Private, Local AI Chatbot Platform by Nomic. 2025. Available online: https://www.nomic.ai/gpt4all (accessed on 14 May 2025).
  119. Liu, Z.; Ping, W.; Roy, R.; Xu, P.; Lee, C.; Shoeybi, M.; Catanzaro, B. ChatQA: Surpassing GPT-4 on Conversational QA and RAG. arXiv 2024, arXiv:2401.10225. [Google Scholar] [CrossRef]
  120. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
  121. OpenAI Product. Available online: https://openai.com/product (accessed on 14 May 2025).
  122. OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Leoni Aleman, F.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
  123. OpenAI; Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; et al. GPT-4o System Card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
  124. Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
  125. Jimeno Yepes, A.; You, Y.; Milczek, J.; Laverde, S.; Li, R. Financial Report Chunking for Effective Retrieval Augmented Generation. arXiv 2024, arXiv:2402.05131. [Google Scholar] [CrossRef]
  126. Ge, J.; Sun, S.; Owens, J.; Galvez, V.; Gologorskaya, O.; Lai, J.C.; Pletcher, M.J.; Lai, K. Development of a Liver Disease-Specific Large Language Model Chat Interface using Retrieval Augmented Generation. medRxiv 2023. [Google Scholar] [CrossRef]
  127. Miao, J.; Thongprayoon, C.; Suppadungsuk, S.; Garcia Valencia, O.A.; Cheungpasitporn, W. Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications. Medicina 2024, 60, 445. [Google Scholar] [CrossRef]
  128. Jiang, Z.; Ma, X.; Chen, W. LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs. arXiv 2024, arXiv:2406.15319. [Google Scholar] [CrossRef]
  129. Jin, M.; Shahriar, S.; Tufano, M.; Shi, X.; Lu, S.; Sundaresan, N.; Svyatkovskiy, A. InferFix: End-to-End Program Repair with LLMs. In Proceedings of the ESEC/FSE 2023—Proceedings of the 31st ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 2023; pp. 1646–1656. [Google Scholar] [CrossRef]
  130. Cheng, P.; Ding, Y.; Ju, T.; Wu, Z.; Du, W.; Yi, P.; Zhang, Z.; Liu, G. TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models. arXiv 2024, arXiv:2405.13401. [Google Scholar] [CrossRef]
  131. Rackauckas, Z. RAG-Fusion: A New Take on Retrieval-Augmented Generation. arXiv 2024, arXiv:2402.03367. [Google Scholar] [CrossRef]
  132. Dong, G.; Zhu, Y.; Zhang, C.; Wang, Z.; Dou, Z.; Wen, J.R. Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation. arXiv 2024, arXiv:2406.18676. [Google Scholar] [CrossRef]
  133. Wang, Z.; Araki, J.; Jiang, Z.; Parvez, M.R.; Neubig, G. Learning to Filter Context for Retrieval-Augmented Generation. arXiv 2023, arXiv:2311.08377. [Google Scholar] [CrossRef]
  134. Soudani, H.; Kanoulas, E.; Hasibi, F. Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, Tokyo, Japan, 9–12 December 2024. [Google Scholar] [CrossRef]
  135. Xu, S.; Pang, L.; Shen, H.; Cheng, X.; Chua, T.S. Search-in-the-Chain: Interactively Enhancing Large Language Models with Search for Knowledge-intensive Tasks. In Proceedings of the WWW 2024—Proceedings of the ACM Web Conference, Singapore, 13–17 May 2024; Association for Computing Machinery, Inc.: Singapore, 2024; pp. 1362–1373. [Google Scholar] [CrossRef]
  136. Ke, Z.; Kong, W.; Li, C.; Zhang, M.; Mei, Q.; Bendersky, M. Bridging the Preference Gap between Retrievers and LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 10438–10451. [Google Scholar] [CrossRef]
  137. Cuconasu, F.; Trappolini, G.; Siciliano, F.; Filice, S.; Campagnano, C.; Maarek, Y.; Tonellotto, N.; Silvestri, F. The Power of Noise: Redefining Retrieval for RAG Systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024. [Google Scholar] [CrossRef]
  138. Baek, J.; Jeong, S.; Kang, M.; Park, J.C.; Hwang, S.J. Knowledge-Augmented Language Model Verification. In Proceedings of the EMNLP 2023—2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics (ACL): Singapore, 2023; pp. 1720–1736. [Google Scholar]
  139. Barnett, S.; Kurniawan, S.; Thudumu, S.; Brannelly, Z.; Abdelrazek, M. Seven Failure Points When Engineering a Retrieval Augmented Generation System. arXiv 2024, arXiv:2401.05856. [Google Scholar] [CrossRef]
  140. Adolphs, L.; Shuster, K.; Urbanek, J.; Szlam, A.; Weston, J. Reason first, then respond: Modular Generation for Knowledge-infused Dialogue. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 7141–7161. [Google Scholar]
  141. Guo, Y.; Qiu, W.; Leroy, G.; Wang, S.; Cohen, T. Retrieval augmentation of large language models for lay language generation. J. Biomed. Inform. 2024, 149, 104580. [Google Scholar] [CrossRef]
  142. Shi, Z.; Zhang, S.; Sun, W.; Gao, S.; Ren, P.; Chen, Z.; Ren, Z. Generate-then-Ground in Retrieval-Augmented Generation for Multi-hop Question Answering. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 7339–7353. [Google Scholar] [CrossRef]
  143. Jiang, Z.; Xu, F.F.; Gao, L.; Sun, Z.; Liu, Q.; Dwivedi-Yu, J.; Yang, Y.; Callan, J.; Neubig, G. Active Retrieval Augmented Generation. arXiv 2023, arXiv:2305.06983. [Google Scholar] [CrossRef]
  144. Su, W.; Tang, Y.; Ai, Q.; Wu, Z.; Liu, Y. DRAGIN: Dynamic Retrieval Augmented Generation based on the Information Needs of Large Language Models. arXiv 2024, arXiv:2403.10081. [Google Scholar] [CrossRef]
  145. Béchard, P.; Marquez Ayala, O. Reducing hallucination in structured outputs via Retrieval-Augmented Generation. arXiv 2024, arXiv:2404.08189. [Google Scholar] [CrossRef]
  146. Li, J.; Liu, Y.; Fan, W.; Wei, X.Y.; Liu, H.; Tang, J.; Li, Q. Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective. arXiv 2023, arXiv:2306.06615. [Google Scholar] [CrossRef]
  147. Tsai, Y.; Liu, M.; Ren, H. RTLFixer: Automatically Fixing RTL Syntax Errors with Large Language Model. In Proceedings of the 61st ACM/IEEE Design Automation Conference, San Francisco, CA, USA, 23–27 June 2024. [Google Scholar] [CrossRef]
  148. Matsumoto, N.; Moran, J.; Choi, H.; Hernandez, M.E.; Venkatesan, M.; Wang, P.; Moore, J.H. KRAGEN: A knowledge graph-enhanced RAG framework for biomedical problem solving using large language models. Bioinformatics 2024, 40, btae353. [Google Scholar] [CrossRef]
  149. Zeng, S.; Zhang, J.; He, P.; Liu, Y.; Xing, Y.; Xu, H.; Ren, J.; Chang, Y.; Wang, S.; Yin, D.; et al. The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG). In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 4505–4524. [Google Scholar] [CrossRef]
  150. Yu, H.; Guo, P.; Sano, A. Zero-Shot ECG Diagnosis with Large Language Models and Retrieval-Augmented Generation. In Proceedings of the Machine Learning for Health (ML4H), PMLR, New Orleans, LA, USA, 10 December 2023; pp. 650–663. [Google Scholar]
  151. Jin, J.; Zhu, Y.; Dong, G.; Zhang, Y.; Yang, X.; Zhang, C.; Zhao, T.; Yang, Z.; Dou, Z.; Wen, J.R. FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research. arXiv 2024, arXiv:2405.13576. [Google Scholar] [CrossRef]
  152. Wang, B.; Ping, W.; Xu, P.; McAfee, L.; Liu, Z.; Shoeybi, M.; Dong, Y.; Kuchaiev, O.; Li, B.; Xiao, C.; et al. Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study. In Proceedings of the EMNLP 2023—2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics (ACL): Singapore, 2023; pp. 7763–7786. [Google Scholar]
  153. Hu, Y.; Lei, Z.; Zhang, Z.; Pan, B.; Ling, C.; Zhao, L. GRAG: Graph Retrieval-Augmented Generation. arXiv 2024, arXiv:2405.16506. [Google Scholar] [CrossRef]
  154. Levonian, Z.; Li, C.; Zhu, W.; Gade, A.; Henkel, O.; Postle, M.E.; Xing, W. Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference. arXiv 2023, arXiv:2310.03184. [Google Scholar] [CrossRef]
  155. Yu, W. Retrieval-augmented generation across heterogeneous knowledge. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, Seattle, WA, USA and Online, 10–15 July 2022; pp. 52–58. [Google Scholar]
  156. Du, X.; Ji, H. Retrieval-Augmented Generative Question Answering for Event Argument Extraction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 4649–4666. [Google Scholar] [CrossRef]
  157. Di Palma, D. Retrieval-Augmented Recommender System: Enhancing Recommender Systems with Large Language Models. In Proceedings of the 17th ACM Conference on Recommender Systems, Singapore, 18–22 September 2023. [Google Scholar] [CrossRef]
  158. Jeong, M.; Sohn, J.; Sung, M.; Kang, J. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics 2024, 40, i119–i129. [Google Scholar] [CrossRef]
  159. Yu, W.; Zhang, H.; Pan, X.; Cao, P.; Ma, K.; Li, J.; Wang, H.; Yu, D. Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 14672–14685. [Google Scholar] [CrossRef]
  160. Wang, Z.; Liu, A.; Lin, H.; Li, J.; Ma, X.; Liang, Y. RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation. arXiv 2024, arXiv:2403.05313. [Google Scholar] [CrossRef]
  161. Wu, Y.; Zhu, J.; Xu, S.; Shum, K.; Niu, C.; Zhong, R.; Song, J.; Zhang, T. RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models. arXiv 2023, arXiv:2401.00396. [Google Scholar] [CrossRef]
  162. Wang, X.; Wang, Z.; Gao, X.; Zhang, F.; Wu, Y.; Xu, Z.; Shi, T.; Wang, Z.; Li, S.; Qian, Q.; et al. Searching for Best Practices in Retrieval-Augmented Generation. arXiv 2024, arXiv:2407.01219. [Google Scholar] [CrossRef]
  163. Cheng, X.; Luo, D.; Chen, X.; Liu, L.; Zhao, D.; Yan, R. Lift Yourself Up: Retrieval-augmented Text Generation with Self Memory. arXiv 2023, arXiv:2305.02437. [Google Scholar] [CrossRef]
  164. Wang, J.; Jiang, R.; Yang, C.; Wu, Z.; Onizuka, M.; Shibasaki, R.; Koshizuka, N.; Xiao, C. Large Language Models as Urban Residents: An LLM Agent Framework for Personal Mobility Generation. arXiv 2024, arXiv:2402.14744. [Google Scholar] [CrossRef]
  165. Yang, Y.; Xu, C.; Guo, J.; Feng, T.; Ruan, C. Improving the RAG-based Personalized Discharge Care System by Introducing the Memory Mechanism. Preprints 2024. [Google Scholar] [CrossRef]
  166. Baek, J.; Chandrasekaran, N.; Cucerzan, S.; Herring, A.; Jauhar, S.K. Knowledge-Augmented Large Language Models for Personalized Contextual Query Suggestion. In Proceedings of the ACM Web Conference, Singapore, 13–17 May 2024. [Google Scholar] [CrossRef]
  167. Parvez, M.R.; Ahmad, W.U.; Chakraborty, S.; Ray, B.; Chang, K.W. Retrieval Augmented Code Generation and Summarization. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 2719–2734. [Google Scholar]
  168. Tian, Z.; Bi, W.; Li, X.; Zhang, N.L. Learning to abstract for memory-augmented conversational response generation. In Proceedings of the ACL 2019—57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 3816–3825. [Google Scholar]
  169. Cheng, X.; Wang, X.; Zhang, X.; Ge, T.; Chen, S.Q.; Wei, F.; Zhang, H.; Zhao, D. xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token. arXiv 2024, arXiv:2405.13792. [Google Scholar] [CrossRef]
  170. Shi, E.; Wang, Y.; Tao, W.; Du, L.; Zhang, H.; Han, S.; Zhang, D.; Sun, H. RACE: Retrieval-augmented Commit Message Generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 5520–5530. [Google Scholar] [CrossRef]
  171. Thulke, D.; Daheim, N.; Dugast, C.; Ney, H. Efficient Retrieval Augmented Generation from Unstructured Knowledge for Task-Oriented Dialog. arXiv 2021, arXiv:2102.04643. [Google Scholar] [CrossRef]
  172. Alkhalaf, M.; Yu, P.; Yin, M.; Deng, C. Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records. J. Biomed. Inform. 2024, 156, 104662. [Google Scholar] [CrossRef] [PubMed]
  173. Ranjit, M.; Ganapathy, G.; Manuel, R.; Ganu, T. Retrieval Augmented Chest X-Ray Report Generation using OpenAI GPT models. arXiv 2023, arXiv:2305.03660. [Google Scholar] [CrossRef]
  174. Dixit, T.; Paranjape, B.; Hajishirzi, H.; Zettlemoyer, L. CORE: A Retrieve-then-Edit Framework for Counterfactual Data Generation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 2964–2984. [Google Scholar]
  175. Salemi, A.; Zamani, H. Evaluating Retrieval Quality in Retrieval-Augmented Generation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024. [Google Scholar] [CrossRef]
  176. Tang, Y.; Yang, Y. MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries. arXiv 2024, arXiv:2401.15391. [Google Scholar] [CrossRef]
  177. Xue, J.; Zheng, M.; Hu, Y.; Liu, F.; Chen, X.; Lou, Q. BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models. arXiv 2024, arXiv:2406.00083. [Google Scholar] [CrossRef]
  178. Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking Large Language Models in Retrieval-Augmented Generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Wooldridge, M., Dy, J., Natarajan, S., Eds.; Volume 38, pp. 17754–17762. [Google Scholar] [CrossRef]
  179. Deng, G.; Liu, Y.; Wang, K.; Li, Y.; Zhang, T.; Liu, Y. Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning. arXiv 2024, arXiv:2402.08416. [Google Scholar] [CrossRef]
  180. Wu, K.; Wu, E.; Zou, J. ClashEval: Quantifying the tug-of-war between an LLM’s internal prior and external evidence. arXiv 2024, arXiv:2404.10198. [Google Scholar] [CrossRef]
  181. Chen, J.; Hu, X.; Li, Z.; Gao, C.; Xia, X.; Lo, D. Code Search is All You Need? Improving Code Suggestions with Code Search. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, Lisbon, Portugal, 14–20 April 2024. [Google Scholar] [CrossRef]
  182. Liu, Y.; Peng, X.; Zhang, X.; Liu, W.; Yin, J.; Cao, J.; Du, T. RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 4730–4749. [Google Scholar] [CrossRef]
  183. Kresevic, S.; Giuffrè, M.; Ajcevic, M.; Accardo, A.; Crocè, L.S.; Shung, D.L. Optimization of hepatological clinical guidelines interpretation by large language models: A retrieval augmented generation-based framework. NPJ Digit. Med. 2024, 7, 102. [Google Scholar] [CrossRef]
  184. Su, Y.; Wang, Y.; Cai, D.; Baker, S.; Korhonen, A.; Collier, N. PROTOTYPE-TO-STYLE: Dialogue Generation with Style-Aware Editing on Retrieval Memory. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2152–2161. [Google Scholar] [CrossRef]
  185. Shi, W.; Zhuang, Y.; Zhu, Y.; Iwinski, H.; Wattenbarger, M.; Wang, M.D. Retrieval-Augmented Large Language Models for Adolescent Idiopathic Scoliosis Patients in Shared Decision-Making. In Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Houston, TX, USA, 3–6 September 2023. [Google Scholar] [CrossRef]
  186. Colverd, G.; Darm, P.; Silverberg, L.; Kasmanoff, N. FloodBrain: Flood Disaster Reporting by Web-based Retrieval Augmented Generation with an LLM. arXiv 2023, arXiv:2311.02597. [Google Scholar] [CrossRef]
  187. Saad-Falcon, J.; Khattab, O.; Potts, C.; Zaharia, M. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. arXiv 2023, arXiv:2311.09476. [Google Scholar] [CrossRef]
  188. Es, S.; James, J.; Espinosa-Anke, L.; Schockaert, S. RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv 2023, arXiv:2309.15217. [Google Scholar] [CrossRef]
  189. Lyu, Y.; Li, Z.; Niu, S.; Xiong, F.; Tang, B.; Wang, W.; Wu, H.; Liu, H.; Xu, T.; Chen, E.; et al. CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models. arXiv 2024, arXiv:2401.17043. [Google Scholar] [CrossRef]
  190. Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; et al. Natural Questions: A Benchmark for Question Answering Research. Trans. Assoc. Comput. Linguist. 2019, 7, 452–466. [Google Scholar] [CrossRef]
  191. Nguyen, T.; Rosenberg, M.; Song, X.; Gao, J.; Tiwary, S.; Majumder, R.; Deng, L. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv 2016, arXiv:1611.09268v1. [Google Scholar]
  192. Butler, U. Open Australian Legal Corpus. 2025. Available online: https://huggingface.co/datasets/isaacus/open-australian-legal-corpus (accessed on 14 May 2025).
  193. Tuggener, D.; von Däniken, P.; Peetz, T.; Cieliebak, M. LEDGAR: A Large-Scale Multi-label Corpus for Text Classification of Legal Provisions in Contracts. In Proceedings of the Twelfth Language Resources and Evaluation Conference; European Language Resources Association, Marseille, France, 11–16 May 2020; pp. 1235–1241. [Google Scholar]
  194. Wang, L.L.; Lo, K.; Chandrasekhar, Y.; Reas, R.; Yang, J.; Burdick, D.; Eide, D.; Funk, K.; Katsis, Y.; Kinney, R.M.; et al. CORD-19: The COVID-19 Open Research Dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online, 5–10 July 2020; Association for Computational Linguistics: Miami, FL, USA, 2020. [Google Scholar]
  195. Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W.; Lu, X. PubMedQA: A Dataset for Biomedical Research Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 2567–2577. [Google Scholar] [CrossRef]
  196. Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 1–4 October 2018; pp. 2369–2380. [Google Scholar] [CrossRef]
  197. Ho, X.; Duong Nguyen, A.K.; Sugawara, S.; Aizawa, A. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 6609–6625. [Google Scholar] [CrossRef]
  198. Chen, X.; Fang, H.; Lin, T.Y.; Vedantam, R.; Gupta, S.; Dollar, P.; Zitnick, C.L. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv 2015, arXiv:1504.00325. [Google Scholar] [CrossRef]
  199. Husain, H.; Wu, H.H.; Gazit, T.; Allamanis, M.; Brockschmidt, M. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv 2019, arXiv:1909.09436. [Google Scholar] [CrossRef]
  200. Wilmot, D.; Keller, F. Memory and Knowledge Augmented Language Models for Inferring Salience in Long-Form Stories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 851–865. [Google Scholar] [CrossRef]
  201. Chan, C.M.; Xu, C.; Yuan, R.; Luo, H.; Xue, W.; Guo, Y.; Fu, J. RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation. arXiv 2024, arXiv:2404.00610. [Google Scholar] [CrossRef]
  202. Asai, A.; Gardner, M.; Hajishirzi, H. Evidentiality-guided Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2022), Seattle, WA, USA, 10–15 July 2022; pp. 2226–2243. [Google Scholar]
  203. Abdulrahman Alawwad, H.; Alhothali, A.; Naseem, U.; Alkhathlan, A.; Jamal, A. Enhancing Textbook Question Answering Task with Large Language Models and Retrieval Augmented Generation. arXiv 2024, arXiv:2402.05128. [Google Scholar] [CrossRef]
  204. Chaudhari, H.; Severi, G.; Abascal, J.; Jagielski, M.; Choquette-Choo, C.A.; Nasr, M.; Nita-Rotaru, C.; Oprea, A. Phantom: General Trigger Attacks on Retrieval Augmented Language Generation. arXiv 2024, arXiv:2405.20485. [Google Scholar] [CrossRef]
  205. Qi, Z.; Zhang, H.; Xing, E.; Kakade, S.; Lakkaraju, H. Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems. arXiv 2024, arXiv:2402.17840. [Google Scholar] [CrossRef]
  206. Ovadia, O.; Brief, M.; Mishaeli, M.; Elisha, O. Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs. arXiv 2023, arXiv:2312.05934. [Google Scholar] [CrossRef]
  207. Salemi, A.; Kallumadi, S.; Zamani, H. Optimization Methods for Personalizing Large Language Models through Retrieval Augmentation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024. [Google Scholar] [CrossRef]
  208. Li, Z.; Li, C.; Zhang, M.; Mei, Q.; Bendersky, M. Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. arXiv 2024, arXiv:2407.16833. [Google Scholar] [CrossRef]
  209. Lobentanzer, S.; Feng, S.; Bruderer, N.; Maier, A.; Díaz, A.G.; Strange, A.; Ismail, A.; Kulaga, A.; Dugourd, A.; Zdrazil, B.; et al. A platform for the biomedical application of large language models. Nat. Biotechnol. 2025, 43, 166–169. [Google Scholar] [CrossRef] [PubMed]
  210. Joshi, M.; Choi, E.; Weld, D.; Zettlemoyer, L. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1601–1611. [Google Scholar] [CrossRef]
  211. Trivedi, H.; Balasubramanian, N.; Khot, T.; Sabharwal, A. MuSiQue: Multihop Questions via Single-hop Question Composition. Trans. Assoc. Comput. Linguist. 2022, 10, 539–554. [Google Scholar] [CrossRef]
  212. Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; Mittal, A. FEVER: A Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 809–819. [Google Scholar] [CrossRef]
  213. Geva, M.; Khashabi, D.; Segal, E.; Khot, T.; Roth, D.; Berant, J. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Trans. Assoc. Comput. Linguist. 2021, 9, 346–361. [Google Scholar] [CrossRef]
  214. Dinan, E.; Roller, S.; Shuster, K.; Fan, A.; Auli, M.; Weston, J. Wizard of Wikipedia: Knowledge-Powered Conversational agents. arXiv 2018, arXiv:1811.01241. [Google Scholar] [CrossRef]
  215. Berant, J.; Chou, A.; Frostig, R.; Liang, P. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1533–1544. [Google Scholar]
  216. Clark, P.; Cowhey, I.; Etzioni, O.; Khot, T.; Sabharwal, A.; Schoenick, C.; Tafjord, O. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv 2018, arXiv:1803.05457. [Google Scholar] [CrossRef]
  217. Fan, A.; Jernite, Y.; Perez, E.; Grangier, D.; Weston, J.; Auli, M. ELI5: Long Form Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 3558–3567. [Google Scholar] [CrossRef]
  218. Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language Understanding. arXiv 2020, arXiv:2009.03300. [Google Scholar] [CrossRef]
  219. Kočiský, T.; Schwarz, J.; Blunsom, P.; Dyer, C.; Hermann, K.M.; Melis, G.; Grefenstette, E. The NarrativeQA Reading Comprehension Challenge. arXiv 2017, arXiv:1712.07040. [Google Scholar] [CrossRef]
  220. Mallen, A.; Asai, A.; Zhong, V.; Das, R.; Khashabi, D.; Hajishirzi, H. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. arXiv 2022, arXiv:2212.10511. [Google Scholar] [CrossRef]
  221. Yih, S.W.t.; Richardson, M.; Meek, C.; Chang, M.W.; Suh, J. The Value of Semantic Parse Labeling for Knowledge Base Question Answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, 7–12 August 2016. [Google Scholar]
  222. Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.t. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 8–12 November 2020; pp. 6769–6781. [Google Scholar] [CrossRef]
  223. Stelmakh, I.; Luan, Y.; Dhingra, B.; Chang, M.W. ASQA: Factoid Questions Meet Long-Form Answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 8273–8288. [Google Scholar] [CrossRef]
  224. Mihaylov, T.; Clark, P.; Khot, T.; Sabharwal, A. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. arXiv 2018, arXiv:1809.02789. [Google Scholar] [CrossRef]
  225. Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 2383–2392. [Google Scholar] [CrossRef]
  226. Elsahar, H.; Vougiouklis, P.; Remaci, A.; Gravier, C.; Hare, J.; Laforest, F.; Simperl, E. T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
  227. Lin, S.; Hilton, J.; Evans, O. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 3214–3252. [Google Scholar] [CrossRef]
  228. Levy, O.; Seo, M.; Choi, E.; Zettlemoyer, L. Zero-Shot Relation Extraction via Reading Comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, BC, Canada, 3–4 August 2017; pp. 333–342. [Google Scholar] [CrossRef]
  229. Reddy, S.; Chen, D.; Manning, C.D. CoQA: A Conversational Question Answering Challenge. arXiv 2018, arXiv:1808.07042. [Google Scholar] [CrossRef]
  230. Bai, Y.; Lv, X.; Zhang, J.; Lyu, H.; Tang, J.; Huang, Z.; Du, Z.; Liu, X.; Zeng, A.; Hou, L.; et al. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. arXiv 2023, arXiv:2308.14508. [Google Scholar] [CrossRef]
  231. Bisk, Y.; Zellers, R.; Le bras, R.; Gao, J.; Choi, Y. PIQA: Reasoning about Physical Commonsense in Natural Language. Proc. AAAI Conf. Artif. Intell. 2020, 34, 7432–7439. [Google Scholar] [CrossRef]
  232. Dasigi, P.; Lo, K.; Beltagy, I.; Cohan, A.; Smith, N.A.; Gardner, M. A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers. arXiv 2021, arXiv:2105.03011. [Google Scholar] [CrossRef]
  233. Guo, Q.; Cao, S.; Yi, Z. A medical question answering system using large language models and knowledge graphs. Int. J. Intell. Syst. 2022, 37, 8548–8564. [Google Scholar] [CrossRef]
  234. Hayashi, H.; Budania, P.; Wang, P.; Ackerson, C.; Neervannan, R.; Neubig, G. WikiAsp: A Dataset for Multi-domain Aspect-based Summarization. Trans. Assoc. Comput. Linguist. 2021, 9, 211–225. [Google Scholar] [CrossRef]
  235. Yang, Y.; Yih, W.T.; Meek, C. WikiQA: A challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015. [Google Scholar]
  236. Press, O.; Zhang, M.; Min, S.; Schmidt, L.; Smith, N.A.; Lewis, M. Measuring and Narrowing the Compositionality Gap in Language Models. arXiv 2022, arXiv:2210.03350. [Google Scholar] [CrossRef]
  237. Krithara, A.; Nentidis, A.; Bougiatiotis, K.; Paliouras, G. BioASQ-QA: A manually curated corpus for Biomedical Question Answering. Sci. Data 2023, 10, 170. [Google Scholar] [CrossRef] [PubMed]
  238. Clark, C.; Lee, K.; Chang, M.W.; Kwiatkowski, T.; Collins, M.; Toutanova, K. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 2924–2936. [Google Scholar] [CrossRef]
  239. See, A.; Liu, P.J.; Manning, C.D. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1073–1083. [Google Scholar] [CrossRef]
  240. Lu, S.; Guo, D.; Ren, S.; Huang, J.; Svyatkovskiy, A.; Blanco, A.; Clement, C.; Drain, D.; Jiang, D.; Tang, D.; et al. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv 2021, arXiv:2102.04664. [Google Scholar] [CrossRef]
  241. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv 2019, arXiv:1910.10683. [Google Scholar] [CrossRef]
  242. Wenzek, G.; Lachaux, M.A.; Conneau, A.; Chaudhary, V.; Guzmán, F.; Joulin, A.; Grave, E. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 4003–4012. [Google Scholar]
  243. Talmor, A.; Herzig, J.; Lourie, N.; Berant, J. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4149–4158. [Google Scholar] [CrossRef]
  244. Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2556–2565. [Google Scholar] [CrossRef]
  245. Conover, M.; Hayes, M.; Mathur, A.; Xie, J.; Wan, J.; Shah, S.; Ghodsi, A.; Wendell, P.; Zaharia, M.; Xin, R. Databricks-Dolly-15K. 2023. Available online: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm (accessed on 14 May 2025).
  246. Saha, S.; Yadav, P.; Bauer, L.; Bansal, M. ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense Reasoning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7716–7740. [Google Scholar] [CrossRef]
  247. Jia, X.; Gavves, E.; Fernando, B.; Tuytelaars, T. Guiding Long-Short Term Memory for Image Caption Generation. arXiv 2015, arXiv:1509.04942. [Google Scholar] [CrossRef]
  248. Luo, M.; Zeng, Y.; Banerjee, P.; Baral, C. Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6417–6431. [Google Scholar] [CrossRef]
  249. Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; Choi, Y. HellaSwag: Can a Machine Really Finish Your Sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4791–4800. [Google Scholar] [CrossRef]
  250. Ferguson, J.; Gardner, M.; Hajishirzi, H.; Khot, T.; Dasigi, P. IIRC: A Dataset of Incomplete Information Reading Comprehension Questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 1137–1147. [Google Scholar] [CrossRef]
  251. Schuhmann, C.; Vencu, R.; Beaumont, R.; Kaczmarczyk, R.; Mullis, C.; Katta, A.; Coombes, T.; Jitsev, J.; Komatsuzaki, A. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. arXiv 2021, arXiv:2111.02114. [Google Scholar] [CrossRef]
  252. Talmor, A.; Yoran, O.; Catav, A.; Lahav, D.; Wang, Y.; Asai, A.; Ilharco, G.; Hajishirzi, H.; Berant, J. MultiModalQA: Complex Question Answering over Text, Tables and Images. arXiv 2021, arXiv:2104.06039. [Google Scholar] [CrossRef]
  253. Marino, K.; Rastegari, M.; Farhadi, A.; Mottaghi, R. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. arXiv 2019, arXiv:1906.00067. [Google Scholar] [CrossRef]
  254. Zhang, T.; Luo, H.; Chuang, Y.S.; Fang, W.; Gaitskell, L.; Hartvigsen, T.; Wu, X.; Fox, D.; Meng, H.; Glass, J. Interpretable Unified Language Checking. arXiv 2023, arXiv:2304.03728. [Google Scholar] [CrossRef]
  255. PubMed Database. 1996. Available online: https://pubmed.ncbi.nlm.nih.gov/ (accessed on 14 May 2025).
  256. Zhong, M.; Yin, D.; Yu, T.; Zaidi, A.; Mutuma, M.; Jha, R.; Awadallah, A.H.; Celikyilmaz, A.; Liu, Y.; Qiu, X.; et al. QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 5905–5921. [Google Scholar] [CrossRef]
  257. Zellers, R.; Holtzman, A.; Rashkin, H.; Bisk, Y.; Farhadi, A.; Roesner, F.; Choi, Y. Defending against neural fake news. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; p. 812. [Google Scholar]
  258. WikiData. Available online: https://www.wikipedia.org/ (accessed on 14 May 2025).
  259. Izacard, G.; Lewis, P.; Lomeli, M.; Hosseini, L.; Petroni, F.; Schick, T.; Dwivedi-Yu, J.; Joulin, A.; Riedel, S.; Grave, E. Atlas: Few-shot Learning with Retrieval Augmented Language Models. arXiv 2022, arXiv:2208.03299. [Google Scholar] [CrossRef]
  260. Li, S.; Ji, H.; Han, J. Document-Level Event Argument Extraction by Conditional Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 894–908. [Google Scholar] [CrossRef]
  261. Merity, S.; Xiong, C.; Bradbury, J.; Socher, R. Pointer Sentinel Mixture Models. arXiv 2016, arXiv:1609.07843. [Google Scholar] [CrossRef]
  262. Craswell, N.; Mitra, B.; Yilmaz, E.; Campos, D.; Voorhees, E.M. Overview of the TREC 2019 deep learning track. arXiv 2020, arXiv:2003.07820. [Google Scholar] [CrossRef]
  263. Craswell, N.; Mitra, B.; Yilmaz, E.; Campos, D.F.; Voorhees, E.M. Overview of the TREC 2020 Deep Learning Track. arXiv 2021, arXiv:2102.07662. [Google Scholar] [CrossRef]
  264. Doddington, G.; Mitchell, A.; Przybocki, M.; Ramshaw, L.; Strassel, S.; Weischedel, R. The Automatic Content Extraction (ACE) Program—Tasks, Data, and Evaluation. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal, 26–28 May 2004. [Google Scholar]
  265. Krishna, R.; Hata, K.; Ren, F.; Fei-Fei, L.; Niebles, J.C. Dense-Captioning Events in Videos. arXiv 2017, arXiv:1705.00754. [Google Scholar] [CrossRef]
  266. Gurulingappa, H.; Rajput, A.M.; Roberts, A.; Fluck, J.; Hofmann-Apitius, M.; Toldo, L. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. J. Biomed. Inform. 2012, 45, 885–892. [Google Scholar] [CrossRef] [PubMed]
  267. Lu, W.; Zeng, Z.; Wang, J.; Lu, Z.; Chen, Z.; Zhuang, H.; Chen, C. Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge. arXiv 2024, arXiv:2404.05880. [Google Scholar] [CrossRef]
  268. Nie, Y.; Williams, A.; Dinan, E.; Bansal, M.; Weston, J.; Kiela, D. Adversarial NLI: A New Benchmark for Natural Language Understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4885–4901. [Google Scholar] [CrossRef]
  269. Mao, J.; Ye, J.; Qian, Y.; Pavone, M.; Wang, Y. A Language Agent for Autonomous Driving. arXiv 2023, arXiv:2311.10813. [Google Scholar] [CrossRef]
  270. Zhang, X.; Zhao, J.; LeCun, Y. Character-level Convolutional Networks for Text Classification. In Proceedings of the 29th International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 7–12 December 2015. [Google Scholar]
  271. Hoffart, J.; Yosef, M.A.; Bordino, I.; Fürstenau, H.; Pinkal, M.; Spaniol, M.; Taneva, B.; Thater, S.; Weikum, G. Robust Disambiguation of Named Entities in Text. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, 27–31 July 2011; pp. 782–792. [Google Scholar]
  272. Xiao, Y.; Hou, Y.; Zhou, H.; Diallo, G.; Fiszman, M.; Wolfson, J.; Kilicoglu, H.; Chen, Y.; Su, C.; Xu, H.; et al. Repurposing Non-pharmacological Interventions for Alzheimer’s Diseases through Link Prediction on Biomedical Literature. medRxiv 2023. [Google Scholar] [CrossRef] [PubMed]
  273. Romano, J.D.; Truong, V.; Kumar, R.; Venkatesan, M.; Graham, B.E.; Hao, Y.; Matsumoto, N.; Li, X.; Wang, Z.; Ritchie, M.D.; et al. The Alzheimer’s Knowledge Base: A Knowledge Graph for Alzheimer Disease Research. J. Med. Internet Res. 2024, 26, e46777. [Google Scholar] [CrossRef]
  274. Dong, L.; Huang, S.; Wei, F.; Lapata, M.; Zhou, M.; Xu, K. Learning to Generate Product Reviews from Attributes. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, 3–7 April 2017; Volume 1, pp. 623–632. [Google Scholar]
  275. McAuley, J.; Leskovec, J. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, Hong Kong, China, 12–16 October 2013. [Google Scholar] [CrossRef]
  276. Min, S.; Michael, J.; Hajishirzi, H.; Zettlemoyer, L. AmbigQA: Answering Ambiguous Open-domain Questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 5783–5797. [Google Scholar] [CrossRef]
  277. Penzel, T.; Moody, G.B.; Mark, R.G.; Goldberger, A.L.; Peter, J.H. The apnea-ECG database. In Proceedings of the Computers in Cardiology 2000, Vol.27 (Cat. 00CH37163), Cambridge, MA, USA, 24–27 September 2000; pp. 255–258. [Google Scholar]
  278. Oard, D.; Webber, W.; Kirsch, D.; Golitsynskiy, S. Avocado Research Email Collection; Linguistic Data Consortium: Philadelphia, PA, USA, 2015. [Google Scholar]
  279. Parrish, A.; Chen, A.; Nangia, N.; Padmakumar, V.; Phang, J.; Thompson, J.; Htut, P.M.; Bowman, S. BBQ: A hand-built bias benchmark for question answering. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 2086–2105. [Google Scholar] [CrossRef]
  280. Sharma, E.; Li, C.; Wang, L. BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2204–2213. [Google Scholar] [CrossRef]
  281. Microsoft. Bing. 2009. Available online: https://www.bing.com/ (accessed on 14 May 2025).
  282. Min, S.; Krishna, K.; Lyu, X.; Lewis, M.; Yih, W.t.; Koh, P.; Iyyer, M.; Zettlemoyer, L.; Hajishirzi, H. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 12076–12100. [Google Scholar] [CrossRef]
  283. Mungall, C.J.; McMurry, J.A.; Köhler, S.; Balhoff, J.P.; Borromeo, C.; Brush, M.; Carbon, S.; Conlin, T.; Dunn, N.; Engelstad, M.; et al. The Monarch Initiative: An integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 2017, 45, D712–D722. [Google Scholar] [CrossRef] [PubMed]
  284. Chalkidis, I.; Jana, A.; Hartung, D.; Bommarito, M.; Androutsopoulos, I.; Katz, D.; Aletras, N. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 4310–4330. [Google Scholar] [CrossRef]
  285. Bondarenko, M.; Kerr, D.; Sorichetta, A.; Tatem, A. Census/projection-disaggregated gridded population datasets for 189 countries in 2020 using Built-Settlement Growth Model (BSGM) outputs [Dataset]. University of Southampton, Southampton, UK, 2020. Available online: https://www.worldpop.org/doi/10.5258/SOTON/WP00684 (accessed on 14 May 2025).
  286. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar] [CrossRef]
  287. Edwards, C.; Zhai, C.; Ji, H. Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 595–607. [Google Scholar] [CrossRef]
  288. Taboureau, O.; Nielsen, S.; Audouze, K.; Weinhold, N.; Edsgärd, D.; Roque, F.; Kouskoumvekaki, I.; Bora, A.; Curpan, R.; Jensen, T.; et al. ChemProt: A disease chemical biology database. Nucleic Acids Res. 2010, 39, D367–D372. [Google Scholar] [CrossRef]
  289. Chen, Z.; Hernández Cano, A.; Romanou, A.; Bonnet, A.; Matoba, K.; Salvi, F.; Pagliardini, M.; Fan, S.; Köpf, A.; Mohtashami, A.; et al. MEDITRON-70B: Scaling Medical Pretraining for Large Language Models. arXiv 2023, arXiv:2311.16079. [Google Scholar] [CrossRef]
  290. Tufano, M.; Watson, C.; Bavota, G.; Di Penta, M.; White, M.; Poshyvanyk, D. An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation. arXiv 2018, arXiv:1812.08693. [Google Scholar] [CrossRef]
  291. Liu, C.; Xia, X.; Lo, D.; Liu, Z.; Hassan, A.E.; Li, S. CodeMatcher: Searching Code Based on Sequential Semantics of Important Query Words. ACM Trans. Softw. Eng. Methodol. 2021, 31, 12. [Google Scholar] [CrossRef]
  292. CodeParrot. github-jupyter. 2022. Available online: https://huggingface.co/datasets/codeparrot/github-jupyter (accessed on 14 May 2025).
  293. Speer, R.; Chin, J.; Havasi, C. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. Proc. AAAI Conf. Artif. Intell. 2017, 31, 4444–4451. [Google Scholar] [CrossRef]
  294. Changpinyo, S.; Sharma, P.; Ding, N.; Soricut, R. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training to Recognize Long-Tail Visual Concepts. arXiv 2021, arXiv:2102.08981. [Google Scholar] [CrossRef]
  295. Iyer, S.; Konstas, I.; Cheung, A.; Zettlemoyer, L. Mapping Language to Code in Programmatic Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 1643–1652. [Google Scholar] [CrossRef]
  296. Tjong Kim Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, AB, Canada, 31 May–1 June 2003; pp. 142–147. [Google Scholar]
  297. Roth, D.; Yih, W.t. A Linear Programming Formulation for Global Inference in Natural Language Tasks. In Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004, Boston, MA, USA, 6–7 May 2004; pp. 1–8. [Google Scholar]
  298. Wu, C.S.; Madotto, A.; Liu, W.; Fung, P.; Xiong, C. QAConv: Question Answering on Informative Conversations. arXiv 2021, arXiv:2105.06912. [Google Scholar] [CrossRef]
  299. Chen, Z.; Li, S.; Smiley, C.; Ma, Z.; Shah, S.; Wang, W.Y. ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 6279–6292. [Google Scholar] [CrossRef]
  300. Byeon, M.; Park, B.; Kim, H.; Lee, S.; Baek, W.; Kim, S. COYO-700M: Image-Text Pair Dataset. arXiv 2022, arXiv:2303.03378. [Google Scholar]
  301. Onoe, Y.; Zhang, M.J.Q.; Choi, E.; Durrett, G. CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge. arXiv 2021, arXiv:2109.01653. [Google Scholar] [CrossRef]
  302. Ding, Y.; Wang, Z.; Ahmad, W.U.; Ding, H.; Tan, M.; Jain, N.; Krishna Ramanathan, M.; Nallapati, R.; Bhatia, P.; Roth, D.; et al. CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion. arXiv 2023, arXiv:2310.11248. [Google Scholar] [CrossRef]
  303. Talmor, A.; Yoran, O.; Le Bras, R.; Bhagavatula, C.; Goldberg, Y.; Choi, Y.; Berant, J. CommonsenseQA 2.0: Exposing the Limits of AI through Gamification. arXiv 2022, arXiv:2201.05320. [Google Scholar] [CrossRef]
  304. Baudiš, P.; Šedivý, J. Modeling of the Question Answering Task in the YodaQA System. In Experimental IR Meets Multilinguality, Multimodality, and Interaction; Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G., San Juan, E., Capellato, L., Ferro, N., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 222–228. [Google Scholar]
  305. Ramesh; Vignav, C.N.; Rajpurkar, P. CXR-PRO: MIMIC-CXR with Prior References Omitted (version 1.0.0). PhysioNet 2022. [Google Scholar] [CrossRef]
  306. Satyapanich, T.; Ferraro, F.; Finin, T. CASIE: Extracting Cybersecurity Event Information from Text. Proc. AAAI Conf. Artif. Intell. 2020, 34, 8749–8757. [Google Scholar] [CrossRef]
  307. Li, Y.; Su, H.; Shen, X.; Li, W.; Cao, Z.; Niu, S. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Nagoya, Japan, 27 November–1 December 2017; pp. 986–995. [Google Scholar]
  308. Just, R.; Jalali, D.; Ernst, M.D. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, San Jose, CA, USA, 21–25 July 2014. [Google Scholar] [CrossRef]
  309. DIG Minecraft. 2025. Available online: https://www.digminecraft.com/ (accessed on 14 May 2025).
  310. Dua, D.; Wang, Y.; Dasigi, P.; Stanovsky, G.; Singh, S.; Gardner, M. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MI, USA, 2–7 June 2019; pp. 2368–2378. [Google Scholar] [CrossRef]
  311. Oda, Y.; Fudaba, H.; Neubig, G.; Hata, H.; Sakti, S.; Toda, T.; Nakamura, S. Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation. In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, USA, 9–13 November 2015; pp. 574–584. [Google Scholar] [CrossRef]
  312. Feng, S.; Wan, H.; Gunasekara, C.; Patel, S.; Joshi, S.; Lastras, L. doc2dial: A Goal-Oriented Document-Grounded Dialogue Dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 8118–8128. [Google Scholar] [CrossRef]
  313. Wang, S.; Liu, J.; Song, S.; Cheng, J.; Fu, Y.; Guo, P.; Fang, K.; Zhu, Y.; Dou, Z. DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation. arXiv 2024, arXiv:2406.05654. [Google Scholar] [CrossRef]
  314. Campos, J.A.; Otegi, A.; Soroa, A.; Deriu, J.; Cieliebak, M.; Agirre, E. DoQA—Accessing Domain-Specific FAQs via Conversational QA. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, 5–10 July 2020; pp. 7302–7314. [Google Scholar] [CrossRef]
  315. Segura-Bedmar, I.; Martínez, P.; Herrero-Zazo, M. SemEval-2013 Task 9: Extraction of Drug-Drug Interactions from Biomedical Texts (DDIExtraction 2013). In Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, GA, USA, 14–15 June 2013; pp. 341–350. [Google Scholar]
  316. DynaMed. 2025. Available online: https://www.dynamed.com/ (accessed on 14 May 2025).
  317. Shi, W.; Xu, R.; Zhuang, Y.; Yu, Y.; Zhang, J.; Wu, H.; Zhu, Y.; Ho, J.; Yang, C.; Wang, M.D. EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Volume 2024, pp. 22315–22339. [Google Scholar] [CrossRef]
  318. Zhou, H.; Huang, M.; Zhang, T.; Zhu, X.; Liu, B. Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory. Proc. AAAI Conf. Artif. Intell. 2018, 32, 730–738. [Google Scholar] [CrossRef]
  319. Zhang, X.; Chen, Y.; Hu, S.; Xu, Z.; Chen, J.; Khai Hao, M.; Han, X.; Leng Thai, Z.; Wang, S.; Liu, Z.; et al. Bench: Extending Long Context Evaluation Beyond 100K Tokens. arXiv 2024, arXiv:2402.13718. [Google Scholar] [CrossRef]
  320. Mensink, T.; Uijlings, J.; Castrejon, L.; Goel, A.; Cadar, F.; Zhou, H.; Sha, F.; Araujo, A.; Ferrari, V. Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories. arXiv 2023, arXiv:2306.09224. [Google Scholar] [CrossRef]
  321. Sciavolino, C.; Zhong, Z.; Lee, J.; Chen, D. Simple Entity-Centric Questions Challenge Dense Retrievers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6138–6148. [Google Scholar] [CrossRef]
  322. European Association for the Study of the Liver. EASL recommendations on treatment of hepatitis C: Final update of the series. J. Hepatol. 2020, 73, 1170–1218. [Google Scholar] [CrossRef]
  323. Narayan, S.; Cohen, S.B.; Lapata, M. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 1797–1807. [Google Scholar] [CrossRef]
  324. Facebook Books Dataset. 2018. Available online: https://github.com/sisinflab/LinkedDatasets/tree/master/facebook_book (accessed on 14 May 2025).
  325. Aly, R.; Guo, Z.; Schlichtkrull, M.; Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; Cocarascu, O.; Mittal, A. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information. arXiv 2021, arXiv:2106.05707. [Google Scholar] [CrossRef]
  326. Park, J.; Min, S.; Kang, J.; Zettlemoyer, L.; Hajishirzi, H. FaVIQ: FAct Verification from Information-seeking Questions. arXiv 2021, arXiv:2107.02153. [Google Scholar] [CrossRef]
  327. Kim, J.; Park, S.; Kwon, Y.; Jo, Y.; Thorne, J.; Choi, E. FactKG: Fact Verification via Reasoning on Knowledge Graphs. arXiv 2023, arXiv:2305.06590. [Google Scholar] [CrossRef]
  328. Lee, N.; Ping, W.; Xu, P.; Patwary, M.; Fung, P.N.; Shoeybi, M.; Catanzaro, B. Factuality Enhanced Language Models for Open-Ended Text Generation. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
  329. Kalyan, A.; Kumar, A.; Chandrasekaran, A.; Sabharwal, A.; Clark, P. How much coffee was consumed during EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7318–7328. [Google Scholar] [CrossRef]
  330. Islam, P.; Kannappan, A.; Kiela, D.; Qian, R.; Scherrer, N.; Vidgen, B. FinanceBench: A New Benchmark for Financial Question Answering. arXiv 2023, arXiv:2311.11944. [Google Scholar] [CrossRef]
  331. Jiang, K.; Wu, D.; Jiang, H. FreebaseQA: A New Factoid QA Data Set Matching Trivia-Style Question-Answer Pairs with Freebase. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MI, USA, 2–7 June 2019; pp. 318–323. [Google Scholar] [CrossRef]
  332. Vu, T.; Iyyer, M.; Wang, X.; Constant, N.; Wei, J.; Wei, J.; Tar, C.; Sung, Y.H.; Zhou, D.; Le, Q.; et al. FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 13697–13720. [Google Scholar] [CrossRef]
  333. Zong, Y.; Qiu, X. GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 8817–8825. [Google Scholar] [CrossRef]
  334. Su, Y.; Cai, D.; Wang, Y.; Baker, S.; Korhonen, A.; Collier, N.; Liu, X. Stylistic Dialogue Generation via Information-Guided Reinforcement Learning Strategy. arXiv 2020, arXiv:2004.02202. [Google Scholar] [CrossRef]
  335. Li, M.; Zhou, H.; Zhang, R. Benchingmaking Large Langage Models in Biomedical Triple Extraction. arXiv 2023, arXiv:2310.18463. [Google Scholar] [CrossRef]
  336. Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; Callan, J.; Neubig, G. PAL: Program-aided Language Models. arXiv 2022, arXiv:2211.10435. [Google Scholar] [CrossRef]
  337. Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. Training Verifiers to Solve Math Word Problems. arXiv 2021, arXiv:2110.14168. [Google Scholar] [CrossRef]
  338. Zhou, Y.; Tan, C. Investigating the Effect of Natural Language Explanations on Out-of-Distribution Generalization in Few-shot NLI. In Proceedings of the Second Workshop on Insights from Negative Results in NLP, Punta Cana, Dominican Republic, 10 November 2021; pp. 117–124. [Google Scholar] [CrossRef]
  339. Gao, L.; Biderman, S.; Black, S.; Golding, L.; Hoppe, T.; Foster, C.; Phang, J.; He, H.; Thite, A.; Nabeshima, N.; et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv 2020, arXiv:2101.00027. [Google Scholar] [CrossRef]
  340. Harvard Law Case Corpus. 2024. Available online: https://case.law/ (accessed on 14 May 2025).
  341. Luo, Y.; Shi, M.; Osama Khan, M.; Muneeb Afzal, M.; Huang, H.; Yuan, S.; Tian, Y.; Song, L.; Kouhana, A.; Elze, T.; et al. FairCLIP: Harnessing Fairness in Vision-Language Learning. arXiv 2024, arXiv:2403.19949. [Google Scholar] [CrossRef]
  342. Li, Y.; Li, Z.; Zhang, K.; Dan, R.; Jiang, S.; Zhang, Y. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. arXiv 2023, arXiv:2303.14070. [Google Scholar] [CrossRef] [PubMed]
  343. Ling, W.; Blunsom, P.; Grefenstette, E.; Hermann, K.M.; Kočiský, T.; Wang, F.; Senior, A. Latent Predictor Networks for Code Generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 599–609. [Google Scholar] [CrossRef]
  344. Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Ponde de Oliveira Pinto, H.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
  345. Liu, J.; Xia, C.S.; Wang, Y.; Zhang, L. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
  346. Nakamura, K.; Levy, S.; Tuan, Y.L.; Chen, W.; Wang, W.Y. HybriDialogue: An Information-Seeking Dialogue Dataset Grounded on Tabular and Textual Data. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 481–492. [Google Scholar] [CrossRef]
  347. IMDb. IMDb Non-Commercial Datasets. 2024. Available online: https://developer.imdb.com/non-commercial-datasets/ (accessed on 14 May 2025).
  348. Community, Infineon Developer Community. Developer Community Forum Questions. 1999. Available online: https://community.infineon.com/ (accessed on 14 May 2025).
  349. Documents, Infineon Technologies. XENSIV™—Sensing the World Sensor Solutions for Automotive, Industrial, Consumer and IoT Applications. Available online: https://www.infineon.com/cms/en/product/sensor/mems-microphones/ (accessed on 14 May 2025).
  350. Chen, Y.; Hu, H.; Luan, Y.; Sun, H.; Changpinyo, S.; Ritter, A.; Chang, M.W. Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 14948–14968. [Google Scholar] [CrossRef]
  351. Wu, Z.; Parish, R.; Cheng, H.; Min, S.; Ammanabrolu, P.; Ostendorf, M.; Hajishirzi, H. InSCIt: Information-Seeking Conversations with Mixed-Initiative Interactions. Trans. Assoc. Comput. Linguist. 2023, 11, 453–468. [Google Scholar] [CrossRef]
  352. Demner-Fushman, D.; Kohli, M.D.; Rosenman, M.B.; Shooshan, S.E.; Rodriguez, L.; Antani, S.; Thoma, G.R.; McDonald, C.J. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 2016, 23, 304–310. [Google Scholar] [CrossRef]
  353. Steinberger, R.; Pouliquen, B.; Widiger, A.; Ignat, C.; Erjavec, T.; Tufiş, D.; Varga, D. The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, 22–28 May 2006. [Google Scholar]
  354. Petroni, F.; Piktus, A.; Fan, A.; Lewis, P.; Yazdani, M.; De Cao, N.; Thorne, J.; Jernite, Y.; Karpukhin, V.; Maillard, J.; et al. KILT: A Benchmark for Knowledge Intensive Language Tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 2523–2544. [Google Scholar] [CrossRef]
  355. Paperno, D.; Kruszewski, G.; Lazaridou, A.; Pham, N.Q.; Bernardi, R.; Pezzelle, S.; Baroni, M.; Boleda, G.; Fernández, R. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1525–1534. [Google Scholar] [CrossRef]
  356. Salemi, A.; Mysore, S.; Bendersky, M.; Zamani, H. LaMP: When Large Language Models Meet Personalization. arXiv 2023, arXiv:2304.11406. [Google Scholar] [CrossRef]
  357. Guha, N.; Nyarko, J.; Ho, D.E.; Ré, C.; Chilton, A.; Narayana, A.; Chohlas-Wood, A.; Peters, A.; Waldon, B.; Rockmore, D.N.; et al. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. arXiv 2023, arXiv:2308.11462. [Google Scholar] [CrossRef]
  358. Shuster, K.; Urbanek, J.; Dinan, E.; Szlam, A.; Weston, J. Deploying Lifelong Open-Domain Dialogue Learning. arXiv 2020, arXiv:2008.08076. [Google Scholar] [CrossRef]
  359. Ben Abacha, A.; Agichtein, E.; Pinter, Y.; Demner-Fushman, D. Overview of the Medical Question Answering Task at TREC 2017 LiveQA. In Proceedings of the Text REtrieval Conference, Gaithersburg, MD, USA, 15–17 November 2017. [Google Scholar]
  360. Lyft_2021. 2021. Available online: https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf (accessed on 14 May 2025).
  361. Yue, X.; Ni, Y.; Zhang, K.; Zheng, T.; Liu, R.; Zhang, G.; Stevens, S.; Jiang, D.; Ren, W.; Sun, Y.; et al. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. arXiv 2023, arXiv:2311.16502. [Google Scholar] [CrossRef]
  362. Lu, P.; Bansal, H.; Xia, T.; Liu, J.; Li, C.; Hajishirzi, H.; Cheng, H.; Chang, K.W.; Galley, M.; Gao, J. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. arXiv 2023, arXiv:2310.02255. [Google Scholar] [CrossRef]
  363. MTsample. Available online: https://mtsamples.com/ (accessed on 14 May 2025).
  364. Abacha, A.B.; Mrabet, Y.; Sharp, M.; Goodwin, T.R.; Shooshan, S.E.; Demner-Fushman, D. Bridging the Gap Between Consumers’ Medication Questions and Trusted Answers. Stud. Health Technol. Inform. 2019, 264, 25–29. [Google Scholar] [CrossRef] [PubMed]
  365. Zhang, X.; Tian, C.; Yang, X.; Chen, L.; Li, Z.; Petzold, L.R. AlpaCare:Instruction-tuned Large Language Models for Medical Application. arXiv 2023, arXiv:2310.14558. [Google Scholar] [CrossRef]
  366. Pal, A.; Umapathi, L.K.; Sankarasubbu, M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. In Proceedings of the Conference on Health, Inference, and Learning, Virtual, 7–8 April 2022. [Google Scholar]
  367. Jin, D.; Pan, E.; Oufattole, N.; Weng, W.H.; Fang, H.; Szolovits, P. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Appl. Sci. 2021, 11, 6421. [Google Scholar] [CrossRef]
  368. Zhang, Y.; Dai, H.; Kozareva, Z.; Smola, A.; Song, L. Variational Reasoning for Question Answering with Knowledge Graph. Proc. AAAI Conf. Artif. Intell. 2018, 32, 6069–6076. [Google Scholar] [CrossRef]
  369. Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312. [Google Scholar] [CrossRef]
  370. Dolan, B.; Quirk, C.; Brockett, C. Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources. In Proceedings of the COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, 23–27 August 2004; pp. 350–356. [Google Scholar]
  371. Chen, D.; Dolan, W. Collecting Highly Parallel Data for Paraphrase Evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics, Portland, OR, USA, 19–24 June 2011; pp. 190–200. [Google Scholar]
  372. Xu, J.; Mei, T.; Yao, T.; Rui, Y. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5288–5296. [Google Scholar]
  373. Johnson, A.E.W.; Pollard, T.J.; Greenbaum, N.R.; Lungren, M.P.; Deng, C.y.; Peng, Y.; Lu, Z.; Mark, R.G.; Berkowitz, S.J.; Horng, S. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv 2019, arXiv:1901.07042. [Google Scholar] [CrossRef]
  374. Minecraft Wiki. Available online: https://minecraft.wiki/ (accessed on 14 May 2025).
  375. Sen, P.; Aji, A.F.; Saffari, A. Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 1604–1619. [Google Scholar]
  376. Liu, Y.; Duan, H.; Zhang, Y.; Li, B.; Zhang, S.; Zhao, W.; Yuan, Y.; Wang, J.; He, C.; Liu, Z.; et al. MMBench: Is Your Multi-modal Model an All-around Player? arXiv 2023, arXiv:2307.06281. [Google Scholar] [CrossRef]
  377. Fang, Y.; Liang, X.; Zhang, N.; Liu, K.; Huang, R.; Chen, Z.; Fan, X.; Chen, H. Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models. arXiv 2023, arXiv:2306.08018. [Google Scholar] [CrossRef]
  378. Austin, J.; Odena, A.; Nye, M.; Bosma, M.; Michalewski, H.; Dohan, D.; Jiang, E.; Cai, C.; Terry, M.; Le, Q.; et al. Program Synthesis with Large Language Models. arXiv 2021, arXiv:2108.07732. [Google Scholar] [CrossRef]
  379. MovieLens. 1998. Available online: https://grouplens.org/datasets/movielens/ (accessed on 14 May 2025).
  380. Boecking, B.; Usuyama, N.; Bannur, S.; Castro, D.C.; Schwaighofer, A.; Hyland, S.; Wetscherek, M.; Naumann, T.; Nori, A.; Alvarez-Valle, J.; et al. Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. arXiv 2022, arXiv:2204.09817. [Google Scholar] [CrossRef]
  381. Eric, M.; Goel, R.; Paul, S.; Sethi, A.; Agarwal, S.; Gao, S.; Kumar, A.; Goyal, A.; Ku, P.; Hakkani-Tur, D. MultiWOZ 2.1: A Consolidated Multi-Domain Dialogue Dataset with State Corrections and State Tracking Baselines. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 422–428. [Google Scholar]
  382. Williams, A.; Nangia, N.; Bowman, S. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 1112–1122. [Google Scholar] [CrossRef]
  383. Tao, W.; Wang, Y.; Shi, E.; Du, L.; Han, S.; Zhang, H.; Zhang, D.; Zhang, W. On the Evaluation of Commit Message Generation Models: An Experimental Study. arXiv 2021, arXiv:2107.05373. [Google Scholar] [CrossRef]
  384. Khashabi, D.; Chaturvedi, S.; Roth, M.; Upadhyay, S.; Roth, D. Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 252–262. [Google Scholar] [CrossRef]
  385. Fu, C.; Chen, P.; Shen, Y.; Qin, Y.; Zhang, M.; Lin, X.; Yang, J.; Zheng, X.; Li, K.; Sun, X.; et al. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv 2023, arXiv:2306.13394. [Google Scholar] [CrossRef]
  386. Lin, X.V.; Wang, C.; Zettlemoyer, L.; Ernst, M.D. NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
  387. Agarwal, M.; Chakraborti, T.; Fu, Q.; Gros, D.; Lin, X.V.; Maene, J.; Talamadupula, K.; Teng, Z.; White, J. NeurIPS 2020 NLC2CMD Competition: Translating Natural Language to Bash Commands. arXiv 2021, arXiv:2103.02523. [Google Scholar] [CrossRef]
  388. Riedel, S.; Yao, L.; McCallum, A. Modeling relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
  389. Trischler, A.; Wang, T.; Yuan, X.; Harris, J.; Sordoni, A.; Bachman, P.; Suleman, K. NewsQA: A Machine Comprehension Dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, Vancouver, BC, Canada, 3 August 2017; pp. 191–200. [Google Scholar] [CrossRef]
  390. Agrawal, H.; Desai, K.; Wang, Y.; Chen, X.; Jain, R.; Johnson, M.; Batra, D.; Parikh, D.; Lee, S.; Anderson, P. nocaps: Novel object captioning at scale. arXiv 2018, arXiv:1812.08658. [Google Scholar] [CrossRef]
  391. Bhattacharya, D.; Aronsohn, A.; Price, J.; Lo Re, V. Hepatitis C Guidance 2023 Update: AASLD-IDSA Recommendations for Testing, Managing, and Treating Hepatitis C Virus Infection. Clin. Infect. Dis. 2023, ciad319. [Google Scholar] [CrossRef]
  392. Lee, K.; Chang, M.W.; Toutanova, K. Latent Retrieval for Weakly Supervised Open Domain Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July 2019–2 August 2019; pp. 6086–6096. [Google Scholar] [CrossRef]
  393. Marecek, L.; Anthony-Smith, M.; Mathis, A.H. Prealgebra 2e; OpenStax: Houston, TX, USA, 2020. [Google Scholar]
  394. OpenStreetMap Contributors. Planet Dump. 2017. Available online: https://planet.osm.org (accessed on 14 May 2025).
  395. Dong, Q.; Wan, X.; Cao, Y. ParaSCI: A Large Scientific Paraphrase Dataset for Longer Paraphrase Generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; pp. 424–434. [Google Scholar] [CrossRef]
  396. PubMed Central (PMC) Full-Text Articles. Available online: https://www.ncbi.nlm.nih.gov/pmc/ (accessed on 14 May 2025).
  397. Li, Y.; Du, Y.; Zhou, K.; Wang, J.; Zhao, X.; Wen, J.R. Evaluating Object Hallucination in Large Vision-Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 292–305. [Google Scholar] [CrossRef]
  398. Smith, S.; Patwary, M.; Norick, B.; LeGresley, P.; Rajbhandari, S.; Casper, J.; Liu, Z.; Prabhumoye, S.; Zerveas, G.; Korthikanti, V.; et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. arXiv 2022, arXiv:2201.11990. [Google Scholar] [CrossRef]
  399. Lewis, P.; Wu, Y.; Liu, L.; Minervini, P.; Küttler, H.; Piktus, A.; Stenetorp, P.; Riedel, S. PAQ: 65 Million Probably-Asked Questions and What You Can Do with Them. Trans. Assoc. Comput. Linguist. 2021, 9, 1098–1115. [Google Scholar] [CrossRef]
  400. Wagner, P.; Strodthoff, N.; Bousseljot, R.D.; Kreiseler, D.; Lunze, F.I.; Samek, W.; Schaeffter, T. PTB-XL, a large publicly available electrocardiography dataset. Sci. Data 2020, 7, 154. [Google Scholar] [CrossRef] [PubMed]
  401. Strodthoff, N.; Mehari, T.; Nagel, C.; Aston, P.J.; Sundar, A.; Graff, C.; Kanters, J.K.; Haverkamp, W.; Dössel, O.; Loewe, A.; et al. PTB-XL+, a comprehensive electrocardiographic feature dataset. Sci. Data 2023, 10, 279. [Google Scholar] [CrossRef]
  402. Ge, T.; Hu, J.; Wang, L.; Wang, X.; Chen, S.Q.; Wei, F. In-context Autoencoder for Context Compression in a Large Language Model. arXiv 2023, arXiv:2307.06945. [Google Scholar] [CrossRef]
  403. Valerio Miceli Barone, A.; Sennrich, R. A parallel corpus of Python functions and documentation strings for automated code documentation and code generation. arXiv 2017, arXiv:1707.02275. [Google Scholar] [CrossRef]
  404. Bahrami, M.; Shrikanth, N.C.; Ruangwan, S.; Liu, L.; Mizobuchi, Y.; Fukuyori, M.; Chen, W.P.; Munakata, K.; Menzies, T. PyTorrent: A Python Library Corpus for Large-scale Language Models. arXiv 2021, arXiv:2110.01710. [Google Scholar] [CrossRef]
  405. Anantha, R.; Vakulenko, S.; Tu, Z.; Longpre, S.; Pulman, S.; Chappidi, S. Open-Domain Question Answering Goes Conversational via Question Rewriting. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 520–534. [Google Scholar] [CrossRef]
  406. Rogers, A.; Kovaleva, O.; Downey, M.; Rumshisky, A. Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks. Proc. AAAI Conf. Artif. Intell. 2020, 34, 8722–8731. [Google Scholar] [CrossRef]
  407. Pang, R.Y.; Parrish, A.; Joshi, N.; Nangia, N.; Phang, J.; Chen, A.; Padmakumar, V.; Ma, J.; Thompson, J.; He, H.; et al. QuALITY: Question Answering with Long Input Texts, Yes! arXiv 2021, arXiv:2112.08608. [Google Scholar] [CrossRef]
  408. Tafjord, O.; Gardner, M.; Lin, K.; Clark, P. QuaRTz: An Open-Domain Dataset of Qualitative Relationship Questions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5941–5946. [Google Scholar] [CrossRef]
  409. Choi, E.; He, H.; Iyyer, M.; Yatskar, M.; Yih, W.t.; Choi, Y.; Liang, P.; Zettlemoyer, L. QuAC: Question Answering in Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October 2018–4 November 2018; pp. 2174–2184. [Google Scholar] [CrossRef]
  410. Hosking, T.; Lapata, M. Factorising Meaning and Form for Intent-Preserving Paraphrasing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 1405–1418. [Google Scholar] [CrossRef]
  411. Gupta, A.; Agarwal, A.; Singh, P.; Rai, P. A Deep Generative Framework for Paraphrase Generation. Proc. AAAI Conf. Artif. Intell. 2018, 32, 5149–5156. [Google Scholar] [CrossRef]
  412. Lai, G.; Xie, Q.; Liu, H.; Yang, Y.; Hovy, E. RACE: Large-scale ReAding Comprehension Dataset From Examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 785–794. [Google Scholar] [CrossRef]
  413. ParticleMedia. RAGTruth. Available online: https://github.com/ParticleMedia/RAGTruth (accessed on 14 May 2025).
  414. Zhang, S.; Liu, X.; Liu, J.; Gao, J.; Duh, K.; Van Durme, B. ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension. arXiv 2018, arXiv:1810.12885. [Google Scholar] [CrossRef]
  415. Gehman, S.; Gururangan, S.; Sap, M.; Choi, Y.; Smith, N.A. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 3356–3369. [Google Scholar] [CrossRef]
  416. Völske, M.; Potthast, M.; Syed, S.; Stein, B. TL;DR: Mining Reddit to Learn Automatic Summarization. In Proceedings of the Workshop on New Frontiers in Summarization, Copenhagen, Denmark, 7 September 2017; pp. 59–63. [Google Scholar] [CrossRef]
  417. Lin, B.Y.; Wu, Z.; Yang, Y.; Lee, D.H.; Ren, X. RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 1504–1515. [Google Scholar] [CrossRef]
  418. Ebner, S.; Xia, P.; Culkin, R.; Rawlins, K.; Van Durme, B. Multi-Sentence Argument Linking. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8057–8077. [Google Scholar] [CrossRef]
  419. Lu, Y.; Liu, S.; Zhang, Q.; Xie, Z. RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model. arXiv 2023, arXiv:2308.05345. [Google Scholar] [CrossRef]
  420. Gliwa, B.; Mochol, I.; Biesek, M.; Wawer, A. SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, Hong Kong, China, 4 November 2019; pp. 70–79. [Google Scholar] [CrossRef]
  421. Ordonez, V.; Kulkarni, G.; Berg, T. Im2Text: Describing Images Using 1 Million Captioned Photographs. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Granada Congress and Exhibition Centre, Granada, Spain, 12–17 December 2011. [Google Scholar]
  422. Hudson, D.A.; Manning, C.D. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. arXiv 2019, arXiv:1902.09506. [Google Scholar] [CrossRef]
  423. Scoliosis Research Society. 1966. Available online: https://www.srs.org/ (accessed on 14 May 2025).
  424. Dunn, M.; Sagun, L.; Higgins, M.; Ugur Guney, V.; Cirik, V.; Cho, K. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. arXiv 2017, arXiv:1704.05179. [Google Scholar] [CrossRef]
  425. Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N.A.; Khashabi, D.; Hajishirzi, H. Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv 2022, arXiv:2212.10560. [Google Scholar] [CrossRef]
  426. Sap, M.; Rashkin, H.; Chen, D.; Le Bras, R.; Choi, Y. Social IQa: Commonsense Reasoning about Social Interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 4463–4473. [Google Scholar] [CrossRef]
  427. Kim, H.; Hessel, J.; Jiang, L.; West, P.; Lu, X.; Yu, Y.; Zhou, P.; Bras, R.; Alikhani, M.; Kim, G.; et al. SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 12930–12949. [Google Scholar] [CrossRef]
  428. Pasupat, P.; Liang, P. Compositional Semantic Parsing on Semi-Structured Tables. arXiv 2015, arXiv:1508.00305. [Google Scholar] [CrossRef]
  429. Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
  430. Alt, C.; Gabryszak, A.; Hennig, L. TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 1558–1569. [Google Scholar] [CrossRef]
  431. Berabi, B.; He, J.; Raychev, V.; Vechev, M.T. TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar]
  432. Centre for Research on the Epidemiology of Disasters (CRED); United Nations Office for Disaster Risk Reduction (UNDRR). The Human Cost of Disasters (2000–2019); UNDRR: Geneva, Switzerland, 2020. [Google Scholar]
  433. Kocetkov, D.; Li, R.; Ben Allal, L.; Li, J.; Mou, C.; Muñoz Ferrandis, C.; Jernite, Y.; Mitchell, M.; Hughes, S.; Wolf, T.; et al. The Stack: 3 TB of permissively licensed source code. arXiv 2022, arXiv:2211.15533. [Google Scholar] [CrossRef]
  434. Zhuang, Y.; Yu, Y.; Wang, K.; Sun, H.; Zhang, C. ToolQA: A Dataset for LLM Question Answering with External Tools. arXiv 2023, arXiv:2306.13304. [Google Scholar] [CrossRef]
  435. Adlakha, V.; Dhuliawala, S.; Suleman, K.; de Vries, H.; Reddy, S. TopiOCQA: Open-domain Conversational Question Answering with Topic Switching. Trans. Assoc. Comput. Linguist. 2022, 10, 468–483. [Google Scholar] [CrossRef]
  436. Voorhees, E.; Alam, T.; Bedrick, S.; Demner-Fushman, D.; Hersh, W.R.; Lo, K.; Roberts, K.; Soboroff, I.; Wang, L.L. TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection. arXiv 2020, arXiv:2005.04474. [Google Scholar] [CrossRef]
  437. Qian, H.; Liu, Z.; Zhang, P.; Mao, K.; Lian, D.; Dou, Z.; Huang, T. MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation. arXiv 2024, arXiv:2409.05591. [Google Scholar] [CrossRef]
  438. Honovich, O.; Scialom, T.; Levy, O.; Schick, T. Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 14409–14428. [Google Scholar] [CrossRef]
  439. Wang, X.; Wu, J.; Chen, J.; Li, L.; Wang, Y.F.; Wang, W.Y. VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. arXiv 2019, arXiv:1904.03493. [Google Scholar] [CrossRef]
  440. Liu, M.; Pinckney, N.; Khailany, B.; Ren, H. VerilogEval: Evaluating Large Language Models for Verilog Code Generation. arXiv 2023, arXiv:2309.07544. [Google Scholar] [CrossRef]
  441. Agrawal, A.; Lu, J.; Antol, S.; Mitchell, M.; Zitnick, C.L.; Batra, D.; Parikh, D. VQA: Visual Question Answering. arXiv 2015, arXiv:1505.00468. [Google Scholar] [CrossRef]
  442. Chang, Y.; Narang, M.; Suzuki, H.; Cao, G.; Gao, J.; Bisk, Y. WebQA: Multihop and Multimodal QA. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Ernest N. Morial Convention Center, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  443. Shang, L.; Lu, Z.; Li, H. Neural Responding Machine for Short-Text Conversation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 26–31 July 2015; pp. 1577–1586. [Google Scholar] [CrossRef]
  444. Cohen, D.; Yang, L.; Croft, W.B. WikiPassageQA: A Benchmark Collection for Research on Non-factoid Answer Passage Retrieval. arXiv 2018, arXiv:1805.03797. [Google Scholar] [CrossRef]
  445. WikiEval. 2023. Available online: https://huggingface.co/datasets/explodinggradients/WikiEval (accessed on 14 May 2025).
  446. Asai, A.; Yu, X.; Kasai, J.; Hajishirzi, H. One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Online, 6–14 December 2021. [Google Scholar]
  447. Sakaguchi, K.; Le Bras, R.; Bhagavatula, C.; Choi, Y. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. Proc. AAAI Conf. Artif. Intell. 2020, 34, 8732–8740. [Google Scholar] [CrossRef]
  448. Maekawa, S.; Iso, H.; Gurajada, S.; Bhutani, N. Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; pp. 5506–5521. [Google Scholar] [CrossRef]
  449. Tedeschi, S.; Conia, S.; Cecconi, F.; Navigli, R. Named Entity Recognition for Entity Linking: What Works and What‘s Next. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 2584–2596. [Google Scholar] [CrossRef]
  450. Pilehvar, M.T.; Camacho-Collados, J. WiC: The Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 1267–1273. [Google Scholar] [CrossRef]
  451. Liu, A.; Swayamdipta, S.; Smith, N.A.; Choi, Y. WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 6826–6847. [Google Scholar] [CrossRef]
  452. Asghar, N. Yelp Dataset Challenge: Review Rating Prediction. arXiv 2016, arXiv:1605.05362. [Google Scholar] [CrossRef]
  453. Yelp. Yelp Open Dataset. Available online: https://business.yelp.com/data/resources/open-dataset/ (accessed on 14 May 2025).
  454. Irwin, J.J.; Sterling, T.; Mysinger, M.M.; Bolstad, E.S.; Coleman, R.G. ZINC: A Free Tool to Discover Chemistry for Biology. J. Chem. Inf. Model. 2012, 52, 1757–1768. [Google Scholar] [CrossRef] [PubMed]
Figure 1. PRISMA 2020 flow diagram showing the stages of article selection in this systematic review.
Figure 1. PRISMA 2020 flow diagram showing the stages of article selection in this systematic review.
Bdcc 09 00320 g001
Figure 2. Yearly distribution of identified articles from 2020 to 2025.
Figure 2. Yearly distribution of identified articles from 2020 to 2025.
Bdcc 09 00320 g002
Figure 3. Distribution of Studies by Domain: This bar chart shows the percentage of studies conducted in various areas.
Figure 3. Distribution of Studies by Domain: This bar chart shows the percentage of studies conducted in various areas.
Bdcc 09 00320 g003
Figure 4. High-level schematic of the RAG stack synthesised in this review, showing knowledge & memory, data stores, retrieval & generation pipeline, control layer, and evaluation metrics.
Figure 4. High-level schematic of the RAG stack synthesised in this review, showing knowledge & memory, data stores, retrieval & generation pipeline, control layer, and evaluation metrics.
Bdcc 09 00320 g004
Table 1. Research questions guiding this systematic review.
Table 1. Research questions guiding this systematic review.
IndexResearch QuestionGoal
RQ1What thematic topics have been addressed by highly cited RAG studies?Summarises the main topics in the field, outlining the state of knowledge and identifying gaps in the literature.
RQ2What innovative methods and approaches extend the standard RAG framework?Provides an overview of current research, assisting researchers and engineers in identifying common methodologies, existing studies, and novel approaches.
RQ3What metrics are most frequently used to evaluate the effectiveness of RAG systems?Identifies relevant metrics to support meaningful comparative analyses, essential for benchmarking and advancing the field.
RQ4What challenges and limitations are associated with RAG techniques?Highlights research gaps and opportunities for proposing solutions or suggesting areas for further exploration.
Table 2. Search queries used with each database.
Table 2. Search queries used with each database.
DatabaseQuery
ACM Digital LibraryTitle: (retrieval AND augmented AND generation) OR Abstract: (retrieval AND augmented AND generation)
IEEE Xplore(“Document Title”: retrieval augmented generation) OR (“Publication Title”: retrieval augmented generation) OR (“Abstract”: retrieval augmented generation)
ScopusTITLE-ABS-KEY (retrieval AND augmented AND generation)
ScienceDirectTitle, abstract, keywords: retrieval AND augmented AND generation
DBLPretrieval augmented generation
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Brown, A.; Roman, M.; Devereux, B. A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges. Big Data Cogn. Comput. 2025, 9, 320. https://doi.org/10.3390/bdcc9120320

AMA Style

Brown A, Roman M, Devereux B. A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges. Big Data and Cognitive Computing. 2025; 9(12):320. https://doi.org/10.3390/bdcc9120320

Chicago/Turabian Style

Brown, Andrew, Muhammad Roman, and Barry Devereux. 2025. "A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges" Big Data and Cognitive Computing 9, no. 12: 320. https://doi.org/10.3390/bdcc9120320

APA Style

Brown, A., Roman, M., & Devereux, B. (2025). A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges. Big Data and Cognitive Computing, 9(12), 320. https://doi.org/10.3390/bdcc9120320

Article Metrics

Back to TopTop