A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges
Abstract
1. Introduction
2. Related Work
3. Methodology
3.1. Systematic Review Framework Selection
3.2. Database Selection
- ACM Digital Library: https://dl.acm.org.
- IEEE Xplore: https://ieeexplore.ieee.org/.
- Scopus: https://www.scopus.com/.
- ScienceDirect: https://www.sciencedirect.com/.
- Digital Bibliography and Library Project (DBLP): https://dblp.org/.
3.3. Inclusion and Exclusion Criteria
- Inclusion Criteria:
- Focus: Studies must address RAG or similar systems that rely on retrieval to support text output.
- Publication Date and Citations: Only works from January 2020 to May 2025 are accepted. For 2025 publications, a minimum of 15 citations is required; for those from 2024 or earlier, at least 30 citations are needed.
- Original Contributions: Only works that present new experimental data or fresh ideas are considered.
- Input and Output: Studies may use various input types (e.g., text, images, audio) if retrieval is central, but the final output must be text.
- Exclusion Criteria:
- Relevance: Works that do not pertain to the topic are removed.
- Language: Studies not published in English are excluded.
- Duplicates and Access: Duplicate works or those with unavailable full text are omitted.
3.4. Search Strategy and Terms
3.5. Search Process
3.6. Screening and Study Selection
3.6.1. Initial Screening
3.6.2. Full-Text Screening
3.7. Data Extraction
3.7.1. Data Extraction Methodology: Domains, Specific Tasks, Technique and Results
3.7.2. Dataset Identification Methodology
3.8. Use of Generative AI
3.9. Potential Biases and Mitigations
4. Results
4.1. Excluded Studies
- Irrelevance of Primary Focus (n= 7): Papers whose primary contributions lay outside the augmented generation of retrieval, e.g., robustness of dense search, long-context benchmarks, general GenIR evaluation or system-level optimisations, where RAG appeared only as a peripheral baseline or illustrative example [27,28,29,30,31,32,33].
- Insufficient Emphasis or Ancillary Treatment (n= 7): Studies that incorporated RAG merely as an auxiliary component within broader investigations—such as LLM-human hybrids for marketing research, domain-specific LLM development, knowledge graph construction workflows, multimodal agent toolkits, healthcare task automation, cost-effective classification or materials modelling pipelines—without substantive and dedicated analysis of RAG itself [34,35,36,37,38,39,40].
- Methodological Distinction (n= 2): Works focused on conceptually distinct paradigms from RAG, specifically generative retrieval or generation-augmented retrieval, which invert the standard RAG pipeline by predicting document identifiers rather than conditioning the generation on the retrieved content [41,42].
4.2. Yearly Distribution of Identified Articles
4.3. Domain Characteristics of Included Studies
5. Discussion
5.1. What Are the Key Topics That Are Already Addressed in RAG?
5.1.1. Retrieval Mechanism
5.1.2. Vector Database
5.1.3. Document Chunking
- Knowledge graphs: aggregating graph triples into textual statements for embedding [68].
- Legal documents: breaking cases into (question, snippet, entity, answer) tuples [69].
- Biomedical texts: micro-chunking into fixed five-token units to capture fine-grained concepts [70].
- Multimodal inputs: splitting image–text pairs into aligned patches or entries for vision–language RAG [58].
5.1.4. Vector Encoders
- Foundation & specialised models. API-driven encoders (e.g., text-embedding-ada-002, text-embedding-3-small/large) and proprietary systems (Dragon, E5, BGE) deliver broad coverage with minimal tuning [74,78,79]. Domain-adapted variants, MedLLaMA-13B for biomedicine [70], PubMedBERT for clinical language [57], CodeBERT/CodeT5 for source code, demonstrate versatility in specialised vocabularies [67,80].
- Sparse–dense hybrids. Elastic Learnt Sparse Encoder (ELSER) integrates learnt sparse representations with dense sentence embeddings, balancing latency and recall [81].
5.1.5. Training
5.1.6. Generation Model
5.1.7. Generative Model Families
5.2. What Are the Innovative Methods and Approaches Compared to the Standard Retrieval Augmented Generation?
5.2.1. Pre-Retrieval & Post-Retrieval Stages: The Plumbing That Keeps RAG Watertight
- Pre-Retrieval: How We Feed the Index
- Post-Retrieval: What We Pass to the Model
5.2.2. Prompting & Query Strategies-Making the Front-End Intelligent
5.2.3. Hybrid and Specialised Retrievers: No Single Needle-Finder
5.2.4. Structure-Aware & Graph-Based RAG: “Talk to Me in Triples, Not Tokens”
5.2.5. Iterative & Active Retrieval Loops: From Static Context to Conversational Search
5.2.6. Memory-Augmented RAG: Personalisation and Long-Horizon Context
5.2.7. Agentic & Multi-Tool Pipelines: Orchestrating Reasoning, Tools and Memory
5.2.8. Efficiency & Compression: Token Budgets Still Matter
5.2.9. Modality Expansion: RAG Beyond Plain Text
5.2.10. Synthesis & Outlook
5.3. What Are the Most Frequently Used Metrics for Evaluating the Effectiveness of Retrieval-Augmented Generation Systems?
5.3.1. Automatic Generation Metrics
- Specialized Diversity & Grounding Metrics
5.3.2. Automatic Retrieval Metrics
5.3.3. Other Automated Metrics
- Computational Efficiency
- Robustness & Error Handling
- Contextual Bias
- Image- and Code-Specific Metrics
- Performance Comparison
- Discussion & Recommendations
- Modular Reporting: Package each specialised metric within containerised pipelines to facilitate deployment.
- Benchmark Extensions: Propose extensions to popular RAG benchmarks (e.g., adding hallucination annotations to QA datasets).
- Open-Source Toolkits: Contribute wrappers for less common metrics, such as ES and contextual bias, to public evaluation libraries.
5.3.4. Human Evaluation Metrics
- Correctness & Accuracy
- Relevance
- Hallucination & Groundedness
- Factual Correctness & Consistency
- Comprehensiveness & Quality
- User-Centric Metrics
- Annotation Protocols & Reliability
- Strengths, Limitations & Recommendations
5.3.5. LLM-As-Judge Metrics
- Accuracy via Advanced LLM Verification
- GPT-Based Correctness and Quality Ratings
- Benchmarking Against GPT-4 Judgements
- Harmfulness and Safety Classification
- LLM-Fact-Checker Chains
- G-EVAL: Comprehensive LLM-Judged Evaluation
- Semantic Accuracy via LLM Instruction Models
- Discussion & Recommendations
5.3.6. Automated Frameworks
5.3.7. Holistic Evaluation of RAG Benchmarks
- Connecting the Four Pillars of RGB to Broader RAG Metrics
- Quantitative Meets Qualitative: Trade-Offs in Evaluation
- Domain-Specific Demands and Broader Trends
- Methodological Reflections: Why These Metrics?
- Practical Implications and Future Directions
5.3.8. Datasets
5.4. What Are the Key Challenges and Limitations Associated with Retrieval-Augmented Generation Techniques?
5.4.1. Noise, Heterogeneity, and Multimodal Alignment
5.4.2. Domain Shift, Dataset Alignment, and Generalisation
5.4.3. Modular Pipelines and Error Cascades
5.4.4. Large-Language-Model Constraints and Safety Risks
5.4.5. Security Threats in Retrieval-Augmented Generation
5.4.6. Synthesis and Outlook
- Concluding Remarks
6. Future Work
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| RAG | Retrieval-Augmented Generation |
| LLMs | Large Language Models |
| NLP | Natural Language Processing |
| PRISMA | Preferred Reporting Items for Systematic Reviews and Meta-Analyses |
| DPR | Dense Passage Retriever |
Appendix A. Inter-Rater Agreement for Screening Decisions
| Reviewer 2: Include | Reviewer 2: Exclude | Row Total | |
|---|---|---|---|
| Reviewer 1: Include | 141 | ||
| Reviewer 1: Exclude | 61 | ||
| Column total | 138 | 64 |
Appendix B. Study Characteristics Extracted from the Systematic Review
| Datasets | Chunking Mechanism | Retrieval Mechanism | Vector Space Encoder | Generation Model |
|---|---|---|---|---|
| Domain: Knowledge-Intensive Tasks | ||||
|
|
|
|
|
| Domain: Knowledge-Intensive Tasks Continued… | ||||
|
|
|
|
|
| Domain: Knowledge-Intensive Tasks Continued… | ||||
| ||||
| Domain: Knowledge-Intensive Tasks Continued… | ||||
| ||||
| Domain: Open-Domain Question Answering | ||||
|
|
|
|
|
| Domain: Open-Domain Question Answering Continued… | ||||
|
| |||
| Domain: Software Engineering | ||||
|
|
|
|
|
| Domain: Medical | ||||
|
|
| ||
| Domain: Other | ||||
|
|
| ||
| Domain: Evaluation | ||||
|
| |||
| Domain: Multimodal | ||||
| ||||
| Domain: Conversational AI | ||||
|
| |||
| Domain: Security/Vulnerabilities | ||||
|
| |||
| Domain: Biomedical | ||||
| Domain: Education | ||||
|
| |||
| Domain: Information Extraction | ||||
| ||||
| Domain: Financial | ||||
| ||||
Appendix C. Datasets Table (Extracted from SLR)
| Dataset Name | Content Description | Intended Use | Citation Frequency |
|---|---|---|---|
| Natural Questions (NQ) [190] | 323,045 QA examples across train/dev/test splits. | Train and evaluate open-domain QA systems. | 27 |
| HotPotQA [196] | 113,000 multi-hop QA pairs. | Train/test QA with multi-hop reasoning and explanations. | 26 |
| Wikipedia [1] | 6 million articles of text and metadata. | General corpus of Wikipedia text for NLP tasks. | 19 |
| TriviaQA (TQA) [210] | 96,000 QA pairs with six supporting documents each. | Develop comprehension models requiring complex inference. | 18 |
| 2WikiMultihopQA (2WikiMQA) [197] | 192,606 multi-hop QA pairs from Wiki data. | Multi-hop QA using structured and unstructured sources. | 11 |
| Multihop Questions via Single-hop Question Composition (MuSiQue) [211] | 25,000 2–4-hop questions (50,000 with contrast). | Multi-hop QA by composing single-hop questions. | 9 |
| Fact Extraction and VERification (FEVER) [212] | 185,445 claims annotated with evidence. | Designed for verifying claims using Wikipedia as the textual source | 8 |
| Microsoft MAchine Reading COmprehension (MS MARCO) [191] | 100,000 questions and 1 M passages from web docs. | Reading comprehension and QA from real web data. | 8 |
| StrategyQA [213] | 2780 yes/no questions with step-by-step reasoning. | Benchmark Boolean QA needing implicit multi-hop reasoning. | 8 |
| Wizard of Wikipedia (WoW) [214] | 22,311 dialogues (202K utterances) using Wiki info. | Dialogue with a “wizard” answering via Wikipedia. | 8 |
| WebQuestions (WebQ) [215] | 6642 QA pairs from real user web queries. | Semantic parsers using Freebase KG. | 7 |
| Arc-Challenge [216] | 2590 science multiple-choice questions. | Benchmark deep-reasoning QA systems. | 5 |
| Explain Like I’m Five (ELI5) [217] | 72k QA pairs with supporting web documents. | Long-form QA understandable by five-year-olds. | 5 |
| Massive Multitask Language Understanding (MMLU) [218] | 57 task-specific single-sentence summaries. | Benchmark broad knowledge and reasoning coverage. | 5 |
| NarrativeQA [219] | 1572 narratives, 46,765 QA pairs. | QA over long narratives and summaries. | 5 |
| PopQA [220] | 14,000 Wikipedia QA pairs across 16 relations. | QA focusing on Wikidata relationship types. | 5 |
| WebQuestions Semantic Parses (WebQSP) [221] | SPARQL queries for 4737 questions, 1073 partial. | KB-QA research using Freebase semantic parses. | 5 |
| Wikipedia English (December 2018) [222] | 21 Million passages from December 2018 English Wikipedia. | Passage corpus for retrieval and QA tasks. | 5 |
| Answer Summaries for Questions which are Ambiguous (ASQA) [223] | 12,632 ambiguous QA annotations. | Long-form QA for ambiguous factoid questions. | 4 |
| OpenbookQA (OBQA) [224] | 6k science MCQs with 1326 core facts. | Multi-hop science QA using core facts. | 4 |
| Stanford Question Answering Dataset (SQuAD) [225] | 23k passages, 108k questions (span answers). | Reading comprehension with span answers. | 4 |
| Triple-based Relation Extraction (TREx) [226] | 3.09 Million abstracts with 11 Million triples. | Relation extraction and KB population tasks. | 4 |
| TruthfulQA [227] | 817 questions across 38 categories. | Evaluate factual consistency in QA. | 4 |
| Zero Shot RE (zsRE) [228] | Over 30 M QA examples for relation extraction. | Zero-shot relation extraction without examples. | 4 |
| Conversational Question Answering (CoQA) [229] | 127k questions from 8k multi-turn dialogues. | Build conversational QA systems. | 3 |
| MultifieldQA-en (MFQA) [230] | 150 docs, 150 cases, 4.6k words each. | Single-document long-context QA. | 3 |
| Physical Interaction: Question Answering (PIQA) [231] | 16,000 physical commonsense MCQs. | Reason about everyday physical tasks. | 3 |
| PubMedQA [195] | PubMed abstracts QA (yes/no/maybe) | Biomedical QA benchmarking. | 3 |
| Qasper (QASP) [232] | 416 papers, 371 cases, 4.7k tokens per doc | Academic QA over research papers. | 3 |
| Unified Medical Language System (UMLS) [233] | Integrated biomedical vocabularies. | Standardize medical terminologies. | 3 |
| Wikipedia Aspect-based summarization (WikiAsp) [234] | 320,272 docs with section-title aspects. | Aspect-based summarization of Wikipedia articles. | 3 |
| WikiQA [235] | 3047 questions with Wikipedia candidate sentences. | Evaluate answer-sentence selection in QA. | 3 |
| Bamboogle [236] | 125 handcrafted 2-hop reasoning questions. | Evaluate compositional reasoning capabilities. | 2 |
| BioASQ [237] | 4k+ PDFs and 1k domain-specific questions | Biomedical retrieval & QA tasks. | 2 |
| BoolQ [238] | 16,000 yes/no questions with passages. | Boolean question answering | 2 |
| C Code Summarization Dataset (CCSD) [51] | 95k function–summary pairs. | Source code summarization. | 2 |
| CNN/Daily Mail [239] | News articles paired with human-written summaries. | Summarization and hallucination benchmarking. | 2 |
| Code mixed-language GLUE (General Language Understanding Evaluation) (CodeXGLUE) [240] | Millions of code—NL pairs across tasks. | Code understanding and generation. | 2 |
| CodeSearchNet (CSNet) [199] | 6 M functions, 2 M docstring pairs in six langs. | Semantic code search evaluation. | 2 |
| Colossal Clean Crawled Corpus (C4) [241] | Billions of English tokens from web. | Unsupervised pre-training for NLP models. | 2 |
| Common Crawl dump of the internet (CCNet) [242] | 1.5 B documents, 532 B tokens across 174 langs. | Pre-training large-scale language models. | 2 |
| Common Objects in Context (COCO) [198] | 330k images, 1.5 M captions. | Object recognition and image captioning. | 2 |
| CommonsenseQA [243] | 12,247 MCQs from ConceptNet subgraphs. | Evaluate commonsense question answering. | 2 |
| Conceptual Caption (CC) [244] | 3.3 M image–text pairs. | Pretrain vision-language models. | 2 |
| Dolly [245] | 15k human-crafted instruction–response pairs | Instruction-following model training. | 2 |
| Enron Email [167] | 500k corporate emails for PII extraction tasks | Evaluate PII detection and removal | 2 |
| ExplaGraphs [246] | 3166 belief-argument-explanation graphs. | Commonsense reasoning via explanation graphs. | 2 |
| Flickr30k [247] | 30k images with five captions each. | Image captioning research. | 2 |
| Google Search corpus (GSfull) [248] | 280k sentences from Google Search snippets. | Visual QA (OK-VQA) supporting data. | 2 |
| HellaSwag [249] | 70k multiple-choice from ActivityNet/WikiHow | Commonsense reasoning evaluation. | 2 |
| Incomplete Information Reading Comprehension Questions (IIRC) [250] | 13,441 questions, 5698 paragraphs. | Challenging reading comprehension. | 2 |
| LAION [251] | Billions of image–text pairs. | Train multi-modal language-vision models. | 2 |
| MultimodalQA [252] | 30k questions, 58k images, text, tables. | Multi-modal QA requiring joint reasoning. | 2 |
| Outside-Knowledge Visual Question Answering (OKVQA) [253] | 14k visual questions needing external knowledge. | Visual QA with outside knowledge. | 2 |
| PubHealth [254] | True/false health-claim questions. | Health-claim verification. | 2 |
| PubMed Clinical Papers [255] | Millions of biomedical abstracts. | Biomedical literature retrieval. | 2 |
| QMSum [256] | Meeting transcripts with query-based summaries. | Query-focused dialogue summarization. | 2 |
| RealNews [257] | 120 GB news articles from Common Crawl. | News summarization benchmark. | 2 |
| RealTimeQA [77] | Weekly news quizzes on politics, business, entertainment. | Evaluate QA on current events requiring retrieval. | 2 |
| RepoEval [50] | Curated GitHubrepos for code completion benchmarks. | Evaluate repository-level code completion. | 2 |
| WikiData [258] | Structured knowledge graph for Wikipedia. | Knowledge-base for various QA tasks. | 2 |
| Wikipedia (December 2021) [259] | 37 M passages, 78-word average. | Updated Wikipedia text corpus. | 2 |
| Wikipedia Event (WikiEvent) [260] | 246 docs, 6132 sentences, 3951 events. | Event extraction and coreference analysis. | 2 |
| WikiText (WikiText) [261] | 103 M words (103); 2 M words (2). | Evaluate long-context language modeling. | 2 |
| 1000-User Benchmark Subset [166] | 1000 user-session sample with 493 queries avg. | Train and evaluate personalized query prediction. | 1 |
| 14 De-identified Clinical Scenarios [56] | 14 anonymized patient scenarios with structured data. | Evaluate clinical query handling. | 1 |
| 2019 TREC Deep Learning track (TREC DL19) [262] | 2019 deep-learning track for passage ranking. | Benchmark passage ranking in IR. | 1 |
| 2020 TREC Deep Learning track (TREC DL20) [263] | 2020 deep-learning track for passage ranking. | Benchmark passage ranking in IR. | 1 |
| 35 Preoperative Guidelines [56] | 35 guidelines on preoperative assessment and care. | RAG knowledge for pre-op instructions. | 1 |
| ACE04 [264] | 300k words train, 50k words evaluation | Entity/relation extraction. | 1 |
| ActivityNet Captions [265] | Consists of 20,000 YouTube videos with 100k localized sentences. | Dense video event description modeling. | 1 |
| ade-corpus-v2 [266] | Sentences labeled for adverse drug reactions. | Text classification focus on ADE detection in biomedical texts | 1 |
| Adversarial Benchmark (AdvBench) [267] | 520 harmful queries simulating jailbreak attacks. | Support defense against adversarial prompts. | 1 |
| Adversarial NLI (ANLI) [268] | Adversarial inference examples. | Evaluating the inference and reasoning robustness of language models. | 1 |
| Adverse Drug Effect (ADE) [266] | 2972 documents on adverse drug effects | Train ADE extraction models. | 1 |
| Agent-Driver [269] | 23,000 driving episodes with states, objects, reasoning chains, actions. | Retrieval-based memory for safe driving planning. | 1 |
| Aggregated flood event listings from EMSR, GDACS, and ReliefWeb [186] | Curated list of major global flood disasters. | Provide event codes for UI. | 1 |
| AGNews [270] | 496k news articles in four topics. | Topic classification in news. | 1 |
| AI Tutor [139] | Course PDFs, HTML, and video transcripts. | Retrieve source-based answers for students. | 1 |
| AIDA CoNLL-YAGO [271] | CoNLL03 news articles linked to YAGO entities. | Named entity disambiguation tasks. | 1 |
| Alzheimer’s Disease Interventions (ADInt) [272] | Pharmaceutical interventions entries. | Advance AD intervention knowledge extraction. | 1 |
| Alzheimer’s knowledge graph (AlzKB) [273] | Neo4j dump of genes, diseases, drugs with NL statements and embeddings. | Drive precise biomedical RAG for Alzheimer’s queries. | 1 |
| Amazon Book Reviews [274] | Reviews with user, product IDs, ratings. | Analyze book recommendation and sentiment. | 1 |
| Amazon Movie Reviews [275] | 42 M reviews, 10 M users, 3 M items. | Recommender-system and sentiment analysis | 1 |
| AmbigQA [276] | 14,042 ambiguous open-domain questions with rewrites. | Benchmark QA systems’ disambiguation ability. | 1 |
| American Association for the Study of Liver Diseases (AASLD) [126] | 30 liver disease clinical practice guidelines | Reference for hepatology QA tasks | 1 |
| Apnea-ECG Dataset (Sleep Apnoea Detection) [277] | 70 long ECG recordings with minute-wise apnea labels. | Detect sleep apnoea via ECG variability. | 1 |
| Arc-Easy [216] | 5197 easy science multiple-choice questions | Benchmark simple science QA | 1 |
| Australian Open Legal QA (ALQA) [192] | 232K legal docs, 69.5 M lines, 1.47 B tokens. | Legal AI research on Australian law. | 1 |
| Automatic Content Extraction 2005 (ACE 2005) [264] | 625k annotated words in English, Arabic, Chinese | Train entity, relation, event extraction. | 1 |
| Avocado Research Email Collection [278] | Corporate email archive with threads and metadata. | Retrieval-augmented personalized email drafting. | 1 |
| Bias Benchmark for Question Answering (BBQ) [279] | Multiple-choice QA testing nine social bias categories. | Diagnose representational harms in QA. | 1 |
| BigPatent [280] | 1.34 M patent documents | Abstractive text summarization. | 1 |
| Bing Search Logs [281] | Three months of anonymized Bing queries and clicks. | Build search-history memory for query suggestion. | 1 |
| BioChatter Continuous-Monitoring Benchmark Suite [209] | Growing suite of biomedical LLM workflow tasks. | Track performance over evolving system features. | 1 |
| BioChatter Knowledge-Graph Query-Generation Benchmark [209] | QA pairs with correct BioCypher graph queries. | Evaluate LLM-to-KG query translation accuracy. | 1 |
| Biography [282] | Long-form biographical narratives of various entities. | Test biographical text generation. | 1 |
| Biomedical Instructions [158] | 18k generated biomedical and clinical instruction sets. | Fine-tune models on diverse biomedical tasks. | 1 |
| Biomedical Multiple Choice Questions (MCQ) [283] | Biomedical MCQs with five answer options. | Evaluate biomedical multiple-choice QA. | 1 |
| CaseHOLD [284] | 846K contract provisions with 12.6K refined labels. | Benchmark legal question-answering systems. | 1 |
| Census/projection-disaggregated gridded population datasets [285] | 2020 global population grid disaggregated by census. | Quantify populations in flood zones. | 1 |
| Chain-of-thought [286] | Explicit multi-step reasoning demonstrations | Foster coherent stepwise reasoning | 1 |
| ChEBI-20 [287] | 33,010 molecule-caption pairs | Chemical image captioning models | 1 |
| Chemical Protein Interaction Corpus (ChemProt) [288] | 2432 PubMed abstracts annotated with interactions. | Chemical-protein relationships and advancing the performance of biomedical relation extraction algorithms | 1 |
| ClashEval Drug Dosage [180] | 249 QA pairs on drug dosages with perturbed contexts. | Benchmark precise dosage retrieval from text. | 1 |
| ClashEval Locations [180] | 200 QA pairs asking for place names from entries. | Test place-name retrieval under context errors. | 1 |
| ClashEval Names [180] | 200 QA pairs querying two-word proper names. | Benchmark proper-noun retrieval against noise. | 1 |
| ClashEval News [180] | 238 numeric QA pairs from AP headline excerpts. | Assess numerical answer extraction under noise. | 1 |
| ClashEval Sports Records [180] | 191 QA pairs on Olympic-record tables with perturbations. | Evaluate correct sports record retrieval. | 1 |
| ClashEval Wikipedia Dates [180] | 200 QA pairs asking for four-digit years from text. | Test year retrieval robustness under corruption. | 1 |
| Clinical Practice Guidelines [289] | Curated guideline articles from MEDITRON. | Support clinical decision-making tasks. | 1 |
| Code Refinement Dataset (CRD) [290] | 2.3 M bug-fix function pairs. | Code repair and refinement. | 1 |
| CodeMatcher [291] | 10.5 M Java methods paired with first doc sentence. | Retrieve exemplar code snippets for generation. | 1 |
| codeparrot/github-jupyter [292] | 165k Jupyter notebooks with metadata | Train code exemplar retrieval | 1 |
| Cognitive Reviewer [139] | Research PDFs analyzed and ranked for reviews. | Facilitate literature reviews via RAG. | 1 |
| ConceptNet [293] | Multilingual commonsense KG with everyday concept triples. | Augment LLM QA with retrieved commonsense subgraphs. | 1 |
| Conceptual 12 M (CC12M) [294] | 12 M image–text pairs from the web. | Pretrain vision-and-language models. | 1 |
| Concode [295] | 100k train, 2k val/test of NL-to-Java examples. | Generate code from natural language. | 1 |
| Conference on Natural Language Learning 2003 (CoNLL03) [296] | 301k English/German tokens for NER. | Named-entity recognition benchmark. | 1 |
| Conference on Natural Language Learning 2004 (CoNLL04) [297] | 2k sentences for NER and SRL. | Joint NER and semantic-role labeling. | 1 |
| Conversation QA (QAConv) [298] | 10,259 conversations; 34,608 QA pairs. | QA from informative multi-turn conversations. | 1 |
| ConvFinQA (CFQA) [299] | Financial QA grounded in tables and text, requiring math. | Table comprehension and arithmetic in dialogues. | 1 |
| Corpus for Enhancement of Lay Language Synthesis (CELLS) [141] | 62,886 abstract—lay summary pairs from biomedical journals. | Simplify scientific text. | 1 |
| COVID-19 Open Research Dataset (CORD19) [194] | >140k articles on COVID-19, SARS, MERS (72k full-text). | COVID-19 literature retrieval & QA. | 1 |
| COYO-700M (COYO) [300] | 747 M image–text pairs with metadata. | Support robust vision-language models. | 1 |
| CREAK [301] | Human-authored true/false entity claims. | Fact-checking and commonsense reasoning. | 1 |
| CrossCodeEval [302] | Multilingual code completion benchmarks in four langs. | Assess cross-language code completion generalization. | 1 |
| CrossCodeLongEval [73] | 5k chunk + 5k function completions from 1500 repos. | Evaluate large-span code completion. | 1 |
| CSQA2.0 [303] | Multiple-choice commonsense QA questions. | Evaluate advanced commonsense reasoning. | 1 |
| Curated Golden Evaluation [60] | Standard queries with tickets and authoritative solutions. | Benchmark retrieval and answer accuracy. | 1 |
| CuratedTrec (CT) [304] | 867 open-domain factoid questions. | Benchmark factoid QA systems. | 1 |
| Current Events [206] | 910 multiple-choice questions from Aug–Nov 2023 U.S. news articles. | Test LLM’s ability to learn new facts via fine-tuning/RAG. | 1 |
| CXR-PRO [305] | 248,236 chest X-ray images with de-identified metadata. | Support thoracic disease detection models. | 1 |
| CyberAttack Sensing and Information Extraction (CASIE) [306] | 1000 English news articles on cybersecurity events. | Extract cybersecurity event information. | 1 |
| DailyDialog [307] | 13,118 daily-life multi-turn dialogues. | Develop human-like conversational agents. | 1 |
| Data Mining and Text Analytics Course Materials Corpus [66] | 500 pages of course textbooks, transcripts, figures. | RAG-enabled Q&A and knowledge retrieval for course. | 1 |
| De-identified electronic health records [172] | 2278 malnutrition-related clinical notes | Validate summarization and extraction | 1 |
| Defects for Java version 1.2 (Defects4J (v1.2)) [308] | 20,109 KLOC of Java code & tests with real bugs. | Evaluate automated bug repair models. | 1 |
| DialogSum [169] | 13k multi-speaker dialogues with human summaries. | Evaluate conversational summarization. | 1 |
| DigMinecraft [309] | Images and step-by-step task instructions | Minecraft planning retrieval | 1 |
| Discrete Reasoning Over Paragraphs (DROP) [310] | 96k questions requiring numeric and logical reasoning. | Benchmark discrete reasoning in QA. | 1 |
| Django [311] | NL descriptions and Django implementation code. | Evaluate NL-to-code generation on Django framework. | 1 |
| Doc2Dial (D2D) [312] | Document-grounded QA across four domains with long texts. | Benchmark passage retrieval in conversational QA. | 1 |
| DomainRAG [313] | Multiple RAG sub-datasets (extractive, noisy, etc.). | Benchmark domain-specific retrieval-augmented generation. | 1 |
| DoQA [314] | Conversational QA over cooking, travel, movie forums. | Domain-specific dialogue QA with unanswerables. | 1 |
| Drug-Drug Interactions (DDI) [315] | 1025 texts from Medline and DrugBank. | Identify and classify drug interactions. | 1 |
| Dynamed [316] | Clinically organized summaries on 3200+ topics. | Point-of-care clinical reference tool. | 1 |
| EHRAgent [317] | Four exemplar EHR cases + 700 patient “experience” records. | Complex reasoning over EHR-based patient scenarios. | 1 |
| Emotion-Specific Dialogue [318] | Chinese dialogues annotated for five emotions. | Train emotion-conditioned dialogue agents. | 1 |
| EN.MC [319] | 229 multiple-choice QAs on novel contexts. | Benchmark novel-based MCQA. | 1 |
| En.QA [319] | 351 QAs on long novels (150k words context). | Test QA over very long texts. | 1 |
| Encyclopedic-VQA [320] | 221k image QA pairs linked to 16.7k entities. | Knowledge-based visual question answering. | 1 |
| EntityQuestion (EQ) [321] | 17,300 QA pairs on 24 relation types | Assess entity-centric knowledge retrieval | 1 |
| European Association for the Study of the Liver Guidelines (EASL) [322] | HCV screening, diagnosis, and treatment guidelines. | Hepatology clinical decision support. | 1 |
| Extreme Summarization (XSum) [323] | 226,711 news articles for single-sentence summaries. | Support abstractive summarization models. | 1 |
| Facebook Books [324] | User–book interactions data. | Research book recommendation systems. | 1 |
| Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS) [325] | 87,026 claims with text and table evidence. | Automate claim verification using text/tables. | 1 |
| FAct Verification from Information-seeking Questions (FaVIQAmbig) [326] | 188,000 true/false claims from info-seeking queries. | Generate and assess factual QA claims. | 1 |
| FactKG [327] | Claims aligned with knowledge-graph triples. | Assess verification over structured KG. | 1 |
| Factual Recall Questions [98] | 30 metadata-style queries (author, decision year, citation, etc.). | Assess factual recall accuracy in legal RAG. | 1 |
| FACTUALITYPROMPTS [328] | Prompts targeting factual accuracy and entity hallucinations | Evaluate factual consistency in generation | 1 |
| False Premise Questions [98] | 22 queries embedding legally incorrect assumptions. | Probe AI’s handling of contrafactual legal prompts. | 1 |
| Fermi [329] | Estimation “Fermi problems.” | Reason about numeric magnitudes and estimates. | 1 |
| Fifty-Four Question-Answer Pairs for Few-Shot Learning [183] | 54 hepatologist-crafted QA examples. | Evaluate few-shot learning in clinical scenarios. | 1 |
| FinanceBench [330] | 80 docs, 141 finance QA questions | Open-book financial QA | 1 |
| Financial News [90] | 79k Chinese news articles with ChatGPT summaries. | Improve summarization and market context knowledge. | 1 |
| Financial Reports [90] | 120k equity research reports with same-day price data. | Teach LLMs technical analysis and trend prediction. | 1 |
| Financial Reports CoT [90] | 200 CoT annotations on financial report predictions. | Teach rationale-rich stock movement predictions. | 1 |
| FLAN [108] | Natural language instructions for zero-shot learning | Boost zero-shot performance and generalization | 1 |
| FloodBrain ablation study dataset [186] | 26 paired human and FloodBrain flood reports | Evaluate pipeline component impact. | 1 |
| FloodBrain evaluation dataset [186] | 10 human vs. 10 FloodBrain-generated flood reports. | Compare generated vs. human summaries. | 1 |
| FreebaseQA [331] | 28k trivia-style QAs mapped to Freebase entities. | KB-grounded question answering. | 1 |
| FreshQA [332] | 600 questions with rapidly changing answers. | Test QA on dynamic answers needing external search. | 1 |
| Gaokao-MM [333] | 646 MCQs across 8 subjects with 897 images. | Test multimodal perception and reasoning. | 1 |
| Gender-Specific Dialogue [334] | Chinese dialogues labeled by speaker gender. | Model gendered linguistic features. | 1 |
| General Legal Research [98] | 80 open-ended legal research questions (common-law, bar exams, doctrine). | Benchmark legal-AI retrieval for practicing attorneys. | 1 |
| GIT [335] | Biomedical triple-extraction dataset for non-drug therapies. | Support biomedical relation extraction models. | 1 |
| GIT Relation Extraction (GITRE) [335] | Sentences with head/tail entities and relations. | Predict relationships between biomedical entities. | 1 |
| GPT-Generated Answer Evaluation Corpus [59] | 100 answers with TA and automated correctness labels. | Quantify model factual accuracy metrics. | 1 |
| GraphQA [48] | Integrates ExplaGraphs, SceneGraphs, WebQSP into QA. | Graph-based QA benchmark. | 1 |
| GSM-HARD [336] | GSM8K variant with larger numeric values | Test arithmetic robustness | 1 |
| GSM8K [337] | 8.5k grade-school math word problems | Benchmark multi-step math reasoning | 1 |
| HANS [338] | Heuristic-bias evaluation for NLI. | Test NLI heuristic vulnerability. | 1 |
| Harry Potter Series (Books3 subset) [339] | Full text of seven books (1 M words). | Study model memorization and extraction from training. | 1 |
| Harvard Law Case Corpus [340] | Extensive collection of Harvard Law case texts. | Pretrain/fine-tune legal language models. | 1 |
| Harvard-FairVLMed [341] | Multimodal fundus images with associated textual data. | Fairness evaluation in ophthalmic vision-language. | 1 |
| HealthcareMagic-101 [342] | 200k doctor-patient medical dialogues | Model sensitive medical conversational contexts | 1 |
| Hearthstone [343] | Game-card logic code paired with card names. | Benchmark NL-to-code on game logic generation. | 1 |
| Historical Issue Tickets [60] | Customer service tickets parsed into hierarchical trees. | Improve retrieval/QA over support tickets. | 1 |
| Hospital Neurology Discharge Summaries [165] | 100 anonymized neurology discharge summaries. | Personalize advice and track recovery via memory. | 1 |
| Human-Edited Counterfactuals Subset of IMDb [174] | 1.7K movie reviews manually sentiment-inverted. | Augment data via sentiment counterfactuals. | 1 |
| Human-Generated Responses [56] | Free-text pre-op instructions by junior doctors | Baseline pre-op instruction generation | 1 |
| HumanEval [344] | 164 Python programming tasks with unit tests. | Evaluate code generation correctness. | 1 |
| HumanEval+ [345] | 164 tasks with 80× more test cases | Robustness evaluation for code generation | 1 |
| HybriDialogue (HDial) [346] | QA on hybrid pages (text + tables) in conversation. | Mixed-modal conversational reasoning. | 1 |
| IMDB (Internet Movie Database) [347] | Subsets of movie reviews and associated metadata. | Sentiment analysis and recommendation tasks. | 1 |
| InferredBugs [129] | 6200 repos; 8280 bug-fix patches. | Support models on static-analysis bug fixes. | 1 |
| Infineon Developer Community Forum Questions [348] | Technical Q&A with expert answers. | Benchmark chatbot against forum solutions. | 1 |
| Infineon Product Documents [349] | Datasheets and product guides. | Retrieval for technical RAG systems. | 1 |
| InfoSeek [350] | 1.3 M image-QA triplets for 11k entities. | Assess external knowledge integration in VQA. | 1 |
| INSCIT [351] | Under-specified Wikipedia QA requiring clarification. | Test clarification question generation. | 1 |
| IU-Xray [352] | Chest X-rays paired with detailed diagnostic reports. | Support medical image-reporting systems. | 1 |
| Joint Research Centre Acquis (JRCAcquis) [353] | 8000 legal docs per language, 20+ EU languages. | Multilingual legal parallel corpus. | 1 |
| Jurisdiction or Time-Specific Research [98] | 70 questions on jurisdictional splits or overturned precedents. | Test RAG on time-sensitive legal rule retrieval. | 1 |
| Knowledge Intensive Language Tasks (KILT) [354] | 11 datasets for fact checking, QA, entity linking. | Unified evaluation of knowledge-intensive tasks. | 1 |
| Labeled EDGAR (LEDGAR) [193] | 846K contract provisions with 12.6K refined labels. | Contract clause classification. | 1 |
| Lambada [355] | Cloze tasks requiring broad discourse context. | Test long-range dependency in LMs. | 1 |
| Language Model Personalization (LaMP) [356] | Seven classification and generation tasks. | Benchmark personalized model outputs. | 1 |
| Lecture-Material [59] | Lecture notes, slides, exercise sheets corpus. | RAG retrieval for course-related queries. | 1 |
| LegalBench Collection [357] | 50 manual legal QA pairs. | Small-scale legal QA benchmarking. | 1 |
| LightQA [140] | QA from role-playing dialogues with final utterance. | Evaluate factual QA in game dialogue contexts. | 1 |
| LightWild [358] | 462K utterances across 41K RPG episodes. | Support dialogue agents in fantasy settings. | 1 |
| LiveQA [359] | Real medical questions with long-form answers. | Evaluate clinical long-answer generation. | 1 |
| LLaVA-Instruct [102] | 158k image–instruction training pairs. | Visual instruction tuning for MLLMs. | 1 |
| Lumos-QG-Generated QA Dataset (9000 Pairs) [66] | 9000 auto-generated QA pairs from course materials. | Expand knowledge base for Alexa skill and evaluation. | 1 |
| lyft_2021 [360] | Lyft 2021 document used for chunking benchmark queries. | Benchmark document-chunking techniques. | 1 |
| Massive Multi-discipline Multimodal Understanding (MMMU) [361] | 11.5k college-level multimodal exam questions. | Expert-level multimodal reasoning evaluation. | 1 |
| Math Nation Queries [154] | 51 factual/conceptual math questions from forum. | Benchmark math QA from student discussions. | 1 |
| MathVista [362] | 6141 math problems with diagrams, charts, plots. | Evaluate multimodal math reasoning. | 1 |
| Medical Transcription Samples (MTsample) [363] | Transcriptions across 40+ clinical specialties. | Research clinical text classification patterns. | 1 |
| MedicationQA [364] | Long-form QA focused on medication queries. | Test medication-related answer accuracy. | 1 |
| MedInstruct [365] | Biomedical instructions: QA, summarization, MCQs. | Fine-tune models on diverse clinical tasks. | 1 |
| MedMCQA [366] | Multiple-choice biomedical questions | Benchmark biomedical QA systems. | 1 |
| MedQA [367] | Multiple-choice medical exam questions | Evaluate medical QA models. | 1 |
| MetaQA [368] | 400k questions covering single- and multi-hop reasoning. | Test end-to-end KG QA systems. | 1 |
| Microsoft COCO (MSCOCO) [369] | 328K images, 2.5 M labeled object instances. | Scene understanding and object detection. | 1 |
| Microsoft Research Paraphrase Corpus (MSRPC) [370] | 2.2k train, 550 val, 1.1k test paraphrase pairs. | Evaluate paraphrase detection. | 1 |
| Microsoft Research Video Description Corpus (MSVD) [371] | 1970 YouTube clips with 80k English descriptions. | Benchmark video captioning models. | 1 |
| Microsoft Research Video to Text (MSRVTT) [372] | 10,000 videos with 200k captions. | Video captioning evaluation across domains. | 1 |
| MIMIC-CXR [373] | Large public CXR images with radiology reports. | Develop chest X-ray interpretation models. | 1 |
| Minecraft Wiki [374] | Thousands of community-curated Minecraft articles | Retrieval for planning tasks | 1 |
| Mintaka [375] | Knowledge graph QA with complex, diverse questions. | Knowledge graph QA benchmark. | 1 |
| MMBench (MMB) [376] | 3k multiple-choice questions covering 20 abilities. | Benchmark fine-grained multimodal capabilities. | 1 |
| Mol-Instructions [377] | Off-the-shelf biomedical instruction tasks. | Instruction-tuning biomedical models. | 1 |
| MongoDB-Logs (Chat & Cost) [59] | Conversation logs and token-cost data. | The logs underpin post hoc accuracy checks, cost calculations and support future optimisation of the chatbot service. | 1 |
| MongoDB-QA (Question Answer Pairs) [59] | 170 validated course QA pairs. | It is sampled by the QAGeneration-Chain to generate quick practice exercises for students. | 1 |
| Mostly Basic Programming Problems (MBPP) [378] | 974 beginner Python problems with tests | Evaluate beginner-level code models | 1 |
| Mostly Basic Programming Problems+ (MBPP+) [345] | MBPP tasks with added test cases | Enhanced MBPP evaluation coverage | 1 |
| MovieLens100K [379] | 100k movie ratings by various users. | Benchmark recommendation algorithms. | 1 |
| MS-CXR [380] | 1153 chest X-rays with paired radiology reports. | Evaluation CXR interpretation and report models. | 1 |
| Multi-Domain Wizard-of-Oz version 2.1 (MultiWOZ 2.1) [381] | 10,438 dialogs across seven domains with slots. | Develop and benchmark multi-domain dialogue. | 1 |
| Multi-Genre Natural Language Inference (MNLI) [382] | 433k sentence pairs labeled entailment/contradiction/neutrality. | Evaluate natural language inference models. | 1 |
| Multi-programming Language Commit Message (MCMD) [383] | 2.25 M commit messages across five programming languages. | Evaluate semantic code search capabilities. | 1 |
| Multi-Sentence Reading Comprehension (MultiRC) [384] | 800 paragraphs with 6000 multi-sentence questions. | Evaluation comprehension over multi-sentence contexts. | 1 |
| Multimodal Evaluation (MME) [385] | 14 tasks in cognition and perception categories. | Standardized benchmark for multimodal LLMs. | 1 |
| Natural Language to Bash (NL2Bash) [386] | 9000+ English descriptions paired with Bash commands. | Translate natural language to shell commands. | 1 |
| Natural Language to Command Line (NLC2CMD) [387] | 100 NL-to-command evaluation examples. | Build NL-to-command translation systems. | 1 |
| New York Times (NYT) [388] | 1.8 M articles published between 1987–2007. | News summarization. | 1 |
| NewsQA [389] | 119k QA pairs from 12.7k CNN news articles. | Human-generated question-answer pairs developed from news articles from CNN | 1 |
| NoCaps [390] | 15k images of novel objects without MSCOCO overlap. | Evaluate novel-object captioning. | 1 |
| North American HCV Guidelines [391] | AASLD-IDSA supplemental HCV practice guidelines. | Supplementary HCV clinical reference. | 1 |
| Online Sources Nursing Knowledge JSON [165] | Scraped nursing instructions and academic papers JSON. | Supply RAG pipeline with clinical knowledge. | 1 |
| OpenQA-NQ (subset of Natural Questions) [392] | 13 M evidence blocks from Wikipedia for QA retrieval. | Open-retrieval question answering. | 1 |
| OpenStax Prealgebra Textbook [393] | Textbook sections on prealgebra | The content from the math textbook is used to generate responses to real student questions. | 1 |
| OpenStreetMap Planet dump [394] | Global vector map data: roads, buildings, POIs. | Enrich flood maps with geographic data. | 1 |
| Osaka Personal Activity Trajectory [164] | 2102 daily check-in trajectories, 537 synthetic samples. | Evaluate mobility framework’s city generalization. | 1 |
| ParaSCI-ACL [395] | 28,883 scientific paraphrase training examples. | Scientific-domain paraphrase generation. | 1 |
| Patient Inquiry Dataset [165] | Timestamped patient questions during system testing. | Evaluate conversational performance and short-term memory. | 1 |
| Patient Symptom Record Dataset [165] | Daily self-reported vital signs and symptom notes. | Monitor condition changes and trigger alerts. | 1 |
| PDFTriage (PDFT) | Questions on PDF document structures. | Benchmark document-structure QA tasks. | 1 |
| PMC Full-text [396] | Full-text articles from PubMed Central. | Enable retrieval for biomedical question answering. | 1 |
| Polling-based Object Probing Evaluation (POPE) [397] | Binary yes/no questions from ground truth objects/negatives. | Assess object hallucination in V-L models. | 1 |
| Pre-training Corpus [398] | 330 B tokens from 15 high-quality sources. | Pretrain RETRO and GPT language models. | 1 |
| Probably-Asked Questions (PAQ) [399] | 65 M auto-generated QA pairs | Semi-structured KB QA knowledge base. | 1 |
| PTB-XL [400] | 21,837 12-lead ECG records with cardiologist annotations. | Arrhythmia diagnosis and zero-shot eval. | 1 |
| PTB-XL+ [401] | Adds algorithm-extracted ECG features for each record. | Detailed ECG feature analysis for diagnosis. | 1 |
| PubMed Abstract [255] | Corpus of PubMed abstracts. | Provide domain evidence for QA retrieval. | 1 |
| PwC Reading-Comprehension Corpus [402] | 241k passage-question-answer triples. | Research on large-context compression. | 1 |
| Python Code Summarization Dataset (PCSD) [403] | 150k function–docstring pairs | Code summarization. | 1 |
| PyTorrent [404] | 2 M Python methods from PyPI/Anaconda packages. | Code exemplar retrieval for Python generation. | 1 |
| QReCC [405] | Open-domain conversational QA over web docs (avg 5K words). | Zero-shot conversational retrieval and QA. | 1 |
| QuAIL [406] | 15k multiple-choice questions across varied texts. | Evaluate adaptive QA across question types. | 1 |
| QuALITY [407] | MCQs from stories/articles (multiple-choice). | Narrative comprehension evaluation. | 1 |
| QuaRTz [408] | 3864 MCQs on qualitative relationships. | Semantic and linguistic reasoning in QA. | 1 |
| Question Answering in Context (QuAC) [409] | Multi-turn dialogues over Wikipedia with answerable turns. | Conversational QA with linked long contexts. | 1 |
| Quora Question Pairs 140K (QQP) [410] | 134k train, 5k val, 5k test paraphrase pairs. | Paraphrase detection and generation. | 1 |
| Quora Question Pairs 50K (QQP) [411] | 50k paraphrase question pairs. | Paraphrase detection and generation. | 1 |
| RACE [412] | Exams-derived reading comprehension dataset. | Benchmark multi-paragraph comprehension. | 1 |
| RAG Comparison (Derived from the SPOKE KG) [57] | Biomedical questions from SPOKE KG entity associations. | Compare RAG: KG, Cypher, full-text methods. | 1 |
| RAG-Fusion Query Set [131] | Dynamically generated multi-query sets. | Enhance retrieval via rank fusion. | 1 |
| RAGTruth [413] | 18,000 LLM-generated responses with quality labels. | Benchmark hallucination detection in RAG. | 1 |
| Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) [414] | 70k passages, 120k queries | Commonsense reading comprehension | 1 |
| REALTOXICITYPROMPTS [415] | Prompts engineered to elicit toxic language. | Evaluate worst-case toxicity in outputs. | 1 |
| Reddit Webis-TLDR-17 [416] | Reddit posts paired with short summaries | Test summarization with varied tones | 1 |
| ReliefWeb flood reports [186] | Human-authored situational flood event reports. | Benchmark report factual accuracy. | 1 |
| Research Dataset [90] | 42k finance texts merging sentiment, numeric, headline tasks. | Pretrain/fine-tune LLMs on financial language. | 1 |
| Retrieval-Augmented Generation Benchmark (RGB) [178] | 1000 English & Chinese QA | Evaluate retrieval-augmented generation | 1 |
| RiddleSense [417] | 5000 riddles with answer options requiring creative reasoning. | Challenge models on linguistic creativity and commonsense. | 1 |
| Roles Across Multiple Sentences (RAMS) [418] | 3993 docs, 9124 event annotations | Multi-sentence semantic role labeling | 1 |
| RTLLM [419] | RTL generation benchmark tasks. | Evaluate LLM-based RTL design generation. | 1 |
| SamSum [420] | 16k messenger-style dialogues with abstractive summaries. | Train dialogue summarization systems. | 1 |
| SBU Captions (SBU) [421] | 1 M Flickr-based image–caption pairs. | Large-scale image captioning research. | 1 |
| SceneGraphs (from GQA) [422] | 100k scene graphs of images for visual reasoning. | Support spatial and visual inference tasks. | 1 |
| Scoliosis Research Society (SRS) [423] | Educational, research, patient resources | Support spinal deformity care | 1 |
| SearchQA [424] | 140k QA pairs, 6.9 M snippets | QA simulating real web search | 1 |
| Self-Instruct [425] | LM-generated instruction examples | Support models on diverse self-generated directives | 1 |
| Sentiment-Specific Dialogue [184] | English dialogues labeled by sentiment. | Generate sentiment-controlled responses. | 1 |
| ServiceNow Internal Data [145] | Annotated queries with structured workflow JSON. | Translate NL requests into workflows. | 1 |
| SocialIQA (SIQA) [426] | 38,000 social-context multiple-choice QA pairs. | Test commonsense reasoning in social contexts. | 1 |
| SODA [427] | High-quality social dialogue examples | Enhance conversational fine-tuning | 1 |
| SQA [428] | Conversational QA over single Wikipedia tables | Compositional multi-column table QA. | 1 |
| SQuAD v2 [225] | 150k QAs plus 50k unanswerable questions on Wikipedia. | QA with answer/no-answer classification. | 1 |
| Stanford Sentiment Treebank (SST2) [429] | 215k phrases labeled for fine-grained sentiment. | Benchmark sentiment classification | 1 |
| StockQA [90] | 21k Chinese QA pairs from real stock-price sequences. | Train time-series reasoning for investor queries. | 1 |
| TACRED [430] | Adapted TACRED for zero/few-shot slot filling (41 types). | Benchmark relation extraction and slot filling. | 1 |
| TAM Questionnaire Response Set [59] | 30 students’ Likert-scale survey responses. | Evaluate user acceptance via factor/regression. | 1 |
| TFix [431] | 100k code error-fix pairs | Evaluate code repair models | 1 |
| The human cost of disasters (2000–2019) [432] | Global disaster human-impact records 2000–2019. | Analyze flood impacts for planning. | 1 |
| The Pile [339] | 825 GiB text from 22 sources | Pretrain diverse language models | 1 |
| The Stack [433] | 3 TB public source code from GitHub. | Pretrain and fine-tune code language models. | 1 |
| Tokyo Personal Activity Trajectory [164] | 100 users’ time-ordered GPS check-ins (2019–2022). | Model realistic human mobility patterns. | 1 |
| ToolQA [434] | Personal-agenda questions assessing external tool use. | Measure LLM integration of external tools in QA. | 1 |
| TopiOCQA (TCQA) [435] | QA over full Wikipedia with topic shifts. | Evaluate topic-transition conversational QA. | 1 |
| TREC-COVID [436] | Dynamic COVID-19 docs with topics and relevance labels. | Pandemic literature retrieval evaluation. | 1 |
| True/False dataset [57] | True/false statements on gene-disease and drug-disease. | Benchmark biomedical assertion verification. | 1 |
| UltraDomain—Agriculture [437] | 2,017,886 tokens from 12 college-agriculture texts | Evaluate RAG’s sense-making in agriculture domain | 1 |
| UltraDomain—CS [437] | 2,306,535 tokens from 10 computer-science texts | Test RAG on technical computer-science content | 1 |
| UltraDomain—Legal [437] | 5,081,069 tokens from 94 legal textbook documents | Benchmark RAG on complex legal language and reasoning | 1 |
| UltraDomain—Mixed [437] | 619,009 tokens across 61 humanities texts | Challenge RAG with heterogeneous humanities content | 1 |
| Unnatural Instructions [438] | Minimally human-curated challenging instructions | Augment instruction tuning diversity | 1 |
| UpToDate | Clinical decision support content by Wolters Kluwer. | Point-of-care medical reference. | 1 |
| VATEX [439] | 25,991 train, 9k val/test English video captions. | Multilingual and multi-modal captioning. | 1 |
| VerilogEval [440] | Verilog code generation tasks. | Assess LLM Verilog functional correctness. | 1 |
| VerilogEval-syntax [147] | 200+ clustered Verilog syntax error examples. | Test syntax-error correction in Verilog. | 1 |
| Visual Question Answering (VQA) [441] | 254,721 images with 760k questions and 10 M answers. | Visual QA tasks combining vision and language. | 1 |
| W3C-Email | Emails similar to GPT-Neo’s training distribution | Study retrieval-augmented memorization effects | 1 |
| Web Search | - | - | 1 |
| WebQA [442] | 34,200 train, 5000 val, 7500 test QA pairs; 390k images. | Multimodal web-based QA benchmarking. | 1 |
| Weibo [443] | 4.4 M post-response pairs from Sina Weibo. | Support short-text conversation models. | 1 |
| WikiPassageQA [444] | 4165 QA with long answer passages. | Reading comprehension with long answers. | 1 |
| Wikipedia (October 2017) [196] | Snapshot of English Wikipedia articles. | Historic Wikipedia text for NLP. | 1 |
| Wikipedia Evaluation (WikiEval) [445] | 50 Wikipedia pages covering diverse topics. | Evaluate retrieval-augmented systems. | 1 |
| Wikipedia Passages [446] | 6 M+ articles, 3.8 B words across languages (as of 2021). | Large-scale text corpus for NLP. | 1 |
| WinoGrande [447] | Pronoun-resolution tasks in complex contexts. | Assess coreference resolution capability. | 1 |
| WitQA [448] | 14k factual QA pairs on 32 relation types | Evaluate factual QA across relations | 1 |
| Wizard of the Internet (WizInt) [45] | 9633 dialogues, 93,665 utterances, 29,500 URLs. | Dialogue with live internet search. | 1 |
| WNED [449] | 320 documents with 6821 linkable mentions. | Evaluate entity linking systems. | 1 |
| Word-in-Context (WiC) [450] | Word-in-context disambiguation pairs. | Evaluate word sense disambiguation. | 1 |
| Worker and AI Collaboration for Natural Language Inference (WaNLI) [451] | 107,885 NLI examples combining human and GPT-3 data. | Natural language inference with AI mix. | 1 |
| Yelp Reviews [452] | 1.1 M+ reviews, 42k businesses, 400k tips, check-ins. | Recommendation and sentiment analysis. | 1 |
| Yelp. 2021 [453] | Business attributes and reviews with detailed schema. | Data-to-text generation and hallucination tests. | 1 |
| ZINC-15 [454] | 1.54 B filtered SMILES strings | Virtual screening compound datasets | 1 |
References
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Kuttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2020, arXiv:2005.11401. [Google Scholar]
- Li, H.; Su, Y.; Cai, D.; Wang, Y.; Liu, L. A Survey on Retrieval-Augmented Text Generation. arXiv 2022, arXiv:2202.01110. [Google Scholar] [CrossRef]
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
- Gupta, S.; Ranjan, R.; Narayan Singh, S. A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions. arXiv 2024, arXiv:2410.12837. [Google Scholar] [CrossRef]
- Wu, S.; Xiong, Y.; Cui, Y.; Wu, H.; Chen, C.; Yuan, Y.; Huang, L.; Liu, X.; Kuo, T.W.; Guan, N.; et al. Retrieval-Augmented Generation for Natural Language Processing: A Survey. arXiv 2024, arXiv:2407.13193. [Google Scholar] [CrossRef]
- Arslan, M.; Ghanem, H.; Munawar, S.; Cruz, C. A Survey on RAG with LLMs. Procedia Comput. Sci. 2024, 246, 3781–3790. [Google Scholar] [CrossRef]
- Fan, W.; Ding, Y.; Ning, L.; Wang, S.; Li, H.; Yin, D.; Chua, T.S.; Li, Q. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024. [Google Scholar] [CrossRef]
- Cheng, M.; Luo, Y.; Ouyang, J.; Liu, Q.; Liu, H.; Li, L.; Yu, S.; Zhang, B.; Cao, J.; Ma, J.; et al. A Survey on Knowledge-Oriented Retrieval-Augmented Generation. arXiv 2025, arXiv:2503.10677. [Google Scholar] [CrossRef]
- Arslan, M.; Munawar, S.; Cruz, C. Business insights using RAG–LLMs: A review and case study. J. Decis. Syst. 2024, 1–30. [Google Scholar] [CrossRef]
- Hindi, M.; Mohammed, L.; Maaz, O.; Alwarafy, A. Enhancing the Precision and Interpretability of Retrieval-Augmented Generation (RAG) in Legal Technology: A Survey. IEEE Access 2025, 13, 46171–46189. [Google Scholar] [CrossRef]
- Huang, Y.; Huang, J. A Survey on Retrieval-Augmented Text Generation for Large Language Models. arXiv 2024, arXiv:2404.10981. [Google Scholar] [CrossRef]
- Zhao, S.; Yang, Y.; Wang, Z.; He, Z.; Qiu, L.K.; Qiu, L. Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely. arXiv 2024, arXiv:2409.14924. [Google Scholar] [CrossRef]
- Verma, S. Contextual Compression in Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2024, arXiv:2409.13385. [Google Scholar] [CrossRef]
- Zhao, P.; Zhang, H.; Yu, Q.; Wang, Z.; Geng, Y.; Fu, F.; Yang, L.; Zhang, W.; Jiang, J.; Cui, B. Retrieval-Augmented Generation for AI-Generated Content: A Survey. arXiv 2024, arXiv:2402.19473. [Google Scholar] [CrossRef]
- Singh, A.; Ehtesham, A.; Kumar, S.; Talaei Khoei, T. Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. arXiv 2025, arXiv:2501.09136. [Google Scholar] [CrossRef]
- Peng, B.; Zhu, Y.; Liu, Y.; Bo, X.; Shi, H.; Hong, C.; Zhang, Y.; Tang, S. Graph Retrieval-Augmented Generation: A Survey. arXiv 2024, arXiv:2408.08921. [Google Scholar] [CrossRef]
- Procko, T.T.; Ochoa, O. Graph Retrieval-Augmented Generation for Large Language Models: A Survey. In Proceedings of the 2024 Conference on AI, Science, Engineering, and Technology (AIxSET), Laguna Hills, CA, USA, 30 September–2 October, 2024; pp. 166–169. [Google Scholar] [CrossRef]
- Zhang, Q.; Chen, S.; Bei, Y.; Yuan, Z.; Zhou, H.; Hong, Z.; Dong, J.; Chen, H.; Chang, Y.; Huang, X. A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models. arXiv 2025, arXiv:2501.13958. [Google Scholar] [CrossRef]
- Mahdi Abootorabi, M.; Zobeiri, A.; Dehghani, M.; Mohammadkhani, M.; Mohammadi, B.; Ghahroodi, O.; Soleymani Baghshah, M.; Asgari, E. Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation. arXiv 2025, arXiv:2502.08826. [Google Scholar] [CrossRef]
- Zheng, X.; Weng, Z.; Lyu, Y.; Jiang, L.; Xue, H.; Ren, B.; Paudel, D.; Sebe, N.; Van Gool, L.; Hu, X. Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook. arXiv 2025, arXiv:2503.18016. [Google Scholar] [CrossRef]
- Simon, K.; Oğuz, C.; Leonid, K.; Muhammad, A.; Saara, A.; Selvine, M.; Daniel, G. Benchmarking of Retrieval Augmented Generation: A Comprehensive Systematic Literature Review on Evaluation Dimensions, Evaluation Metrics and Datasets. In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, Porto, Portugal, 17–19 November 2024. [Google Scholar] [CrossRef]
- Yu, H.; Gan, A.; Zhang, K.; Tong, S.; Liu, Q.; Liu, Z. Evaluation of Retrieval-Augmented Generation: A Survey. arXiv 2024, arXiv:2405.07437. [Google Scholar] [CrossRef]
- Zhou, Y.; Liu, Y.; Li, X.; Jin, J.; Qian, H.; Liu, Z.; Li, C.; Dou, Z.; Ho, T.Y.; Yu, P.S. Trustworthiness in Retrieval-Augmented Generation Systems: A Survey. arXiv 2024, arXiv:2409.10102. [Google Scholar] [CrossRef]
- Ni, B.; Liu, Z.; Wang, L.; Lei, Y.; Zhao, Y.; Cheng, X.; Zeng, Q.; Dong, L.; Xia, Y.; Kenthapadi, K.; et al. Towards Trustworthy Retrieval Augmented Generation for Large Language Models: A Survey. arXiv 2025, arXiv:2502.06872. [Google Scholar] [CrossRef]
- Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. Syst. Rev. 2021, 10, 89. [Google Scholar] [CrossRef]
- Kitchenham, B.; Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; Keele University: Keele, UK, 2007; Volume 2. [Google Scholar]
- Sidiropoulos, G.; Kanoulas, E. Analysing the Robustness of Dual Encoders for Dense Retrieval Against Misspellings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022. [Google Scholar] [CrossRef]
- Kuratov, Y.; Bulatov, A.; Anokhin, P.; Rodkin, I.; Sorokin, D.; Sorokin, A.; Burtsev, M. BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack. arXiv 2024, arXiv:2406.10149. [Google Scholar] [CrossRef]
- Alaofi, M.; Arabzadeh, N.; Clarke, C.L.A.; Sanderson, M. Generative Information Retrieval Evaluation. arXiv 2024, arXiv:2404.08137. [Google Scholar] [CrossRef]
- Kumar, Y.; Marttinen, P. Improving Medical Multi-modal Contrastive Learning with Expert Annotations. In Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature: Milan, Italy, 2024; pp. 468–486. [Google Scholar]
- Wang, M.; Chen, L.; Cheng, F.; Liao, S.; Zhang, X.; Wu, B.; Yu, H.; Xu, N.; Zhang, L.; Luo, R.; et al. Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 5627–5646. [Google Scholar] [CrossRef]
- Wu, J.; Zhu, J.; Qi, Y.; Chen, J.; Xu, M.; Menolascina, F.; Grau, V. Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation. arXiv 2024, arXiv:2408.04187. [Google Scholar] [CrossRef]
- Zheng, L.; Yin, L.; Xie, Z.; Sun, C.; Huang, J.; Hao Yu, C.; Cao, S.; Kozyrakis, C.; Stoica, I.; Gonzalez, J.E.; et al. SGLang: Efficient Execution of Structured Language Model Programs. arXiv 2023, arXiv:2312.07104. [Google Scholar] [CrossRef]
- Arora, N.; Chakraborty, I.; Nishimura, Y. AI–Human Hybrids for Marketing Research: Leveraging Large Language Models (LLMs) as Collaborators. J. Mark. 2025, 89, 43–70. [Google Scholar] [CrossRef]
- Luu, R.K.; Buehler, M.J. BioinspiredLLM: Conversational Large Language Model for the Mechanics of Biological and Bio-Inspired Materials. Adv. Sci. 2024, 11, 2306724. [Google Scholar] [CrossRef]
- Zhang, B.; Soh, H. Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 9820–9836. [Google Scholar] [CrossRef]
- Liu, S.; Cheng, H.; Liu, H.; Zhang, H.; Li, F.; Ren, T.; Zou, X.; Yang, J.; Su, H.; Zhu, J.; et al. LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents. arXiv 2023, arXiv:2311.05437. [Google Scholar] [CrossRef]
- Gebreab, S.A.; Salah, K.; Jayaraman, R.; Rehman, M.H.u.; Ellaham, S. LLM-Based Framework for Administrative Task Automation in Healthcare. In Proceedings of the 2024 12th International Symposium on Digital Forensics and Security (ISDFS), San Antonio, TX, USA, 29–30 April 2024; pp. 1–7. [Google Scholar] [CrossRef]
- Loukas, L.; Stogiannidis, I.; Diamantopoulos, O.; Malakasiotis, P.; Vassos, S. Making LLMs Worth Every Penny: Resource-Limited Text Classification in Banking. In Proceedings of the Fourth ACM International Conference on AI in Finance, Brooklyn, NY, USA, 27–29 November 2023. [Google Scholar] [CrossRef]
- Buehler, M.J. MechGPT, a Language-Based Strategy for Mechanics and Materials Modeling That Connects Knowledge Across Scales, Disciplines, and Modalities. Appl. Mech. Rev. 2024, 76, 021001. [Google Scholar] [CrossRef]
- Chen, J.; Zhang, R.; Guo, J.; de Rijke, M.; Chen, W.; Fan, Y.; Cheng, X. Continual Learning for Generative Retrieval over Dynamic Corpora. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023. [Google Scholar] [CrossRef]
- Mao, Y.; He, P.; Liu, X.; Shen, Y.; Gao, J.; Han, J.; Chen, W. Generation-augmented retrieval for open-domain question answering. In Proceedings of the ACL-IJCNLP 2021—59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Bangkok, Thailand, 1–6 August 2021; pp. 4089–4100. [Google Scholar]
- Ram, O.; Levine, Y.; Dalmedigos, I.; Muhlgay, D.; Shashua, A.; Leyton-Brown, K.; Shoham, Y. In-Context Retrieval-Augmented Language Models. Trans. Assoc. Comput. Linguist. 2023, 11, 1316–1331. [Google Scholar] [CrossRef]
- Xu, P.; Ping, W.; Wu, X.; McAfee, L.; Zhu, C.; Liu, Z.; Subramanian, S.; Bakhturina, E.; Shoeybi, M.; Catanzaro, B. Retrieval meets Long Context Large Language Models. arXiv 2023, arXiv:2310.03025. [Google Scholar] [CrossRef]
- Komeili, M.; Shuster, K.; Weston, J. Internet-Augmented Dialogue Generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1, pp. 8460–8478. [Google Scholar]
- Yan, S.Q.; Gu, J.C.; Zhu, Y.; Ling, Z.H. Corrective Retrieval Augmented Generation. arXiv 2024, arXiv:2401.15884. [Google Scholar] [CrossRef]
- Wang, Y.; Lipka, N.; Rossi, R.A.; Siu, A.; Zhang, R.; Derr, T. Knowledge Graph Prompting for Multi-Document Question Answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Wooldridge, M., Dy, J., Natarajan, S., Eds.; Volume 38, pp. 19206–19214. [Google Scholar] [CrossRef]
- He, X.; Tian, Y.; Sun, Y.; Chawla, N.V.; Laurent, T.; LeCun, Y.; Bresson, X.; Hooi, B. G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering. arXiv 2024, arXiv:2402.07630. [Google Scholar] [CrossRef]
- Shao, Z.; Gong, Y.; Shen, Y.; Huang, M.; Duan, N.; Chen, W. Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. arXiv 2023, arXiv:2305.15294. [Google Scholar] [CrossRef]
- Zhang, F.; Chen, B.; Zhang, Y.; Keung, J.; Liu, J.; Zan, D.; Mao, Y.; Lou, J.G.; Chen, W. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. In Proceedings of the EMNLP 2023—2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics (ACL): Singapore, 2023; pp. 2471–2484. [Google Scholar] [CrossRef]
- Liu, S.; Chen, Y.; Xie, X.; Siow, J.; Liu, Y. Retrieval-augmented generation for code summarization via hybrid GNN. In Proceedings of the ICLR 2021—9th International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
- Yasunaga, M.; Aghajanyan, A.; Shi, W.; James, R.; Leskovec, J.; Liang, P.; Lewis, M.; Zettlemoyer, L.; Yih, W.T. Retrieval-Augmented Multimodal Language Modeling. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 39755–39769. [Google Scholar]
- Gui, L.; Wang, B.; Huang, Q.; Hauptmann, A.; Bisk, Y.; Gao, J. KAT: A Knowledge Augmented Transformer for Vision-and-Language. In Proceedings of the NAACL 2022—2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; Association for Computational Linguistics (ACL): Seattle, WA, USA, 2022; pp. 956–968. [Google Scholar]
- Glass, M.; Rossiello, G.; Chowdhury, M.F.M.; Gliozzo, A. Robust Retrieval Augmented Generation for Zero-shot Slot Filling. In Proceedings of the EMNLP 2021—2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 1939–1949. [Google Scholar]
- Sachan, D.S.; Reddy, S.; Hamilton, W.; Dyer, C.; Yogatama, D. End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; Volume 31, pp. 25968–25981. [Google Scholar]
- Ke, Y.; Jin, L.; Elangovan, K.; Rizal Abdullah, H.; Liu, N.; Sia, A.T.H.; Soh, C.R.; Tung, J.Y.M.; Ong, J.C.L.; Ting, D.S.W. Development and Testing of Retrieval Augmented Generation in Large Language Models—A Case Study Report. arXiv 2024, arXiv:2402.01733. [Google Scholar] [CrossRef]
- Soman, K.; Rose, P.W.; Morris, J.H.; Akbas, R.E.; Smith, B.; Peetoom, B.; Villouta-Reyes, C.; Cerono, G.; Shi, Y.; Rizk-Jackson, A.; et al. Biomedical knowledge graph-optimized prompt generation for large language models. Bioinformatics 2024, 40, btae560. [Google Scholar] [CrossRef]
- Chen, W.; Hu, H.; Chen, X.; Verga, P.; Cohen, W.W. MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text. arXiv 2022, arXiv:2210.02928. [Google Scholar] [CrossRef]
- Neumann, A.T.; Yin, Y.; Sowe, S.; Decker, S.; Jarke, M. An LLM-Driven Chatbot in Higher Education for Databases and Information Systems. IEEE Trans. Educ. 2025, 68, 103–116. [Google Scholar] [CrossRef]
- Xu, Z.; Jerome Cruz, M.; Guevara, M.; Wang, T.; Deshpande, M.; Wang, X.; Li, Z. Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering. arXiv 2024, arXiv:2404.17723. [Google Scholar] [CrossRef]
- Hoshi, Y.; Miyashita, D.; Ng, Y.; Tatsuno, K.; Morioka, Y.; Torii, O.; Deguchi, J. RALLE: A Framework for Developing and Evaluating Retrieval-Augmented Large Language Models. In Proceedings of the EMNLP 2023—2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Singapore, 6–10 December 2023; Feng, Y., Lefever, E., Eds.; Association for Computational Linguistics (ACL): Singapore, 2023; pp. 52–69. [Google Scholar]
- Jiang, W.; Zhang, S.; Han, B.; Wang, J.; Wang, B.; Kraska, T. PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design. arXiv 2024, arXiv:2403.05676. [Google Scholar] [CrossRef]
- Caffagni, D.; Cocchi, F.; Moratelli, N.; Sarto, S.; Cornia, M.; Baraldi, L.; Cucchiara, R. Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–18 June 2024; pp. 1818–1826. [Google Scholar] [CrossRef]
- Guo, Y.; Li, Z.; Jin, X.; Liu, Y.; Zeng, Y.; Liu, W.; Li, X.; Yang, P.; Bai, L.; Guo, J.; et al. Retrieval-Augmented Code Generation for Universal Information Extraction. arXiv 2023, arXiv:2311.02962. [Google Scholar] [CrossRef]
- Xiong, G.; Jin, Q.; Lu, Z.; Zhang, A. Benchmarking Retrieval-Augmented Generation for Medicine. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 6233–6251. [Google Scholar] [CrossRef]
- Alsafari, B.; Atwell, E.; Walker, A.; Callaghan, M. Towards effective teaching assistants: From intent-based chatbots to LLM-powered teaching assistants. Nat. Lang. Process. J. 2024, 8, 100101. [Google Scholar] [CrossRef]
- Yu, C.; Yang, G.; Chen, X.; Liu, K.; Zhou, Y. Bashexplainer: Retrieval-augmented bash code comment generation based on fine-tuned codebert. In Proceedings of the 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME), Limassol, Cyprus, 3–7 October 2022; pp. 82–93. [Google Scholar]
- Guo, T.; Yang, Q.; Wang, C.; Liu, Y.; Li, P.; Tang, J.; Li, D.; Wen, Y. KnowledgeNavigator: Leveraging large language models for enhanced reasoning over knowledge graph. Complex Intell. Syst. 2024, 10, 7063–7076. [Google Scholar] [CrossRef]
- Wiratunga, N.; Abeyratne, R.; Jayawardena, L.; Martin, K.; Massie, S.; Nkisi-Orji, I.; Weerasinghe, R.; Liret, A.; Fleisch, B. CBR-RAG: Case-Based Reasoning for Retrieval Augmented Generation in LLMs for Legal Question Answering. In Case-Based Reasoning Research and Development; Springer Nature: Cham, Switzerland, 2024; pp. 445–460. [Google Scholar]
- Li, M.; Kilicoglu, H.; Xu, H.; Zhang, R. BiomedRAG: A retrieval augmented large language model for biomedicine. J. Biomed. Inform. 2025, 162, 104769. [Google Scholar] [CrossRef]
- Zhang, R.; Du, H.; Liu, Y.; Niyato, D.; Kang, J.; Sun, S.; Shen, X.; Poor, H.V. Interactive AI with Retrieval-Augmented Generation for Next Generation Networking. IEEE Netw. 2024, 38, 414–424. [Google Scholar] [CrossRef]
- Guo, Z.; Xia, L.; Yu, Y.; Ao, T.; Huang, C. LightRAG: Simple and Fast Retrieval-Augmented Generation. arXiv 2024, arXiv:2410.05779. [Google Scholar] [CrossRef]
- Wu, D.; Ahmad, W.U.; Zhang, D.; Krishna Ramanathan, M.; Ma, X. Repoformer: Selective Retrieval for Repository-Level Code Completion. arXiv 2024, arXiv:2403.10059. [Google Scholar] [CrossRef]
- Chen, Z.; Xiang, Z.; Xiao, C.; Song, D.; Li, B. AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases. arXiv 2024, arXiv:2407.12784. [Google Scholar] [CrossRef]
- Ren, Y.; Cao, Y.; Guo, P.; Fang, F.; Ma, W.; Lin, Z. Retrieve-and-Sample: Document-level Event Argument Extraction via Hybrid Retrieval Augmentation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 293–306. [Google Scholar] [CrossRef]
- Chowdhury, J.R.; Zhuang, Y.; Wang, S. Novelty Controlled Paraphrase Generation with Retrieval Augmented Conditional Prompt Tuning. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022, Virtual, 22 February–1 March 2022; Volume 36, pp. 10535–10544. [Google Scholar]
- Zhang, Z.; Fang, M.; Chen, L. RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering. arXiv 2024, arXiv:2402.16457. [Google Scholar] [CrossRef]
- Soong, D.; Sridhar, S.; Si, H.; Wagner, J.S.; Sá, A.C.C.; Yu, C.Y.; Karagoz, K.; Guan, M.; Kumar, S.; Hamadeh, H.; et al. Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model. PLoS Digit Health 2024, 3, e0000568. [Google Scholar] [CrossRef] [PubMed]
- Jin, C.; Zhang, Z.; Jiang, X.; Liu, F.; Liu, X.; Liu, X.; Jin, X. RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation. arXiv 2024, arXiv:2404.12457. [Google Scholar] [CrossRef]
- Wang, W.; Wang, Y.; Joty, S.; Hoi, S.C. RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 3–9 December 2023; pp. 146–158. [Google Scholar]
- Sawarkar, K.; Mangal, A.; Solanki, S.R. Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers. In Proceedings of the 2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, 7–9 August 2024; pp. 155–161. [Google Scholar] [CrossRef]
- Ramos, R.; Elliott, D.; Martins, B. Retrieval-augmented Image Captioning. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics; Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; pp. 3666–3681. [Google Scholar] [CrossRef]
- Yang, Z.; Ping, W.; Liu, Z.; Korthikanti, V.; Nie, W.; Huang, D.A.; Fan, L.; Yu, Z.; Lan, S.; Li, B.; et al. Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 11844–11857. [Google Scholar]
- Chen, J.; Pan, Y.; Li, Y.; Yao, T.; Chao, H.; Mei, T. Retrieval Augmented Convolutional Encoder-Decoder Networks for Video Captioning. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–24. [Google Scholar] [CrossRef]
- Tian, Y.; Song, H.; Wang, Z.; Wang, H.; Hu, Z.; Wang, F.; Chawla, N.V.; Xu, P. Graph Neural Prompting with Large Language Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Wooldridge, M., Dy, J., Natarajan, S., Eds.; Association for the Advancement of Artificial Intelligence: Vancouver, BC, Canada, 2024; Volume 38, pp. 19080–19088. [Google Scholar] [CrossRef]
- Lin, W.; Byrne, B. Retrieval Augmented Visual Question Answering with Outside Knowledge. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 11238–11254. [Google Scholar]
- Hofstätter, S.; Chen, J.; Raman, K.; Zamani, H. FiD-Light: Efficient and Effective Retrieval-Augmented Text Generation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023. [Google Scholar] [CrossRef]
- Feng, Z.; Feng, X.; Zhao, D.; Yang, M.; Qin, B. Retrieval-generation synergy augmented large language models. arXiv 2023, arXiv:2310.05149. [Google Scholar] [CrossRef]
- Jeong, C. A Study on the Implementation of Generative AI Services Using an Enterprise Data-Based LLM Application Architecture. arXiv 2023, arXiv:2309.01105. [Google Scholar] [CrossRef]
- Li, X.; Li, Z.; Shi, C.; Xu, Y.; Du, Q.; Tan, M.; Huang, J. AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 773–783. [Google Scholar]
- Xia, P.; Zhu, K.; Li, H.; Zhu, H.; Li, Y.; Li, G.; Zhang, L.; Yao, H. RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 1081–1093. [Google Scholar] [CrossRef]
- Sarto, S.; Cornia, M.; Baraldi, L.; Cucchiara, R. Retrieval-Augmented Transformer for Image Captioning. In Proceedings of the 19th International Conference on Content-Based Multimedia Indexing, Graz, Austria, 14–16 September 2022. [Google Scholar] [CrossRef]
- Siriwardhana, S.; Weerasekera, R.; Wen, E.; Kaluarachchi, T.; Rana, R.; Nanayakkara, S. Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering. arXiv 2022, arXiv:2210.02627. [Google Scholar] [CrossRef]
- Izacard, G.; Grave, E. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume; Association for Computational Linguistics, Virtual, 19–23 April 2021; pp. 874–880. [Google Scholar] [CrossRef]
- Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; van den Driessche, G.; Lespiau, J.B.; Damoc, B.; Clark, A.; et al. Improving language models by retrieving from trillions of tokens. arXiv 2021, arXiv:2112.04426. [Google Scholar] [CrossRef]
- Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv 2023, arXiv:2310.11511. [Google Scholar] [CrossRef]
- Luo, R.; Sun, L.; Xia, Y.; Qin, T.; Zhang, S.; Poon, H.; Liu, T.Y. BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinform. 2022, 23, bbac409. [Google Scholar] [CrossRef]
- Magesh, V.; Surani, F.; Dahl, M.; Suzgun, M.; Manning, C.D.; Ho, D.E. Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. arXiv 2024, arXiv:2405.20362. [Google Scholar] [CrossRef]
- Wang, Y.; Wang, W.; Joty, S.; Hoi, S.C. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 8696–8708. [Google Scholar] [CrossRef]
- Pearce, H.; Ahmad, B.; Tan, B.; Dolan-Gavitt, B.; Karri, R. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. arXiv 2021, arXiv:2108.09293. [Google Scholar] [CrossRef]
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv 2023, arXiv:2304.10592. [Google Scholar] [CrossRef]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. arXiv 2023, arXiv:2304.08485. [Google Scholar] [CrossRef] [PubMed]
- Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar] [CrossRef]
- Anthropic. Chat with Claude. 2024. Available online: https://claude.ai/chats (accessed on 14 May 2025).
- Workshop, B.; Le Scao, T.; Fan, A.; Akiki, C.; Pavlick, E.; Ilić, S.; Hesslow, D.; Castagné, R.; Sasha Luccioni, A.; Yvon, F.; et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv 2022, arXiv:2211.05100. [Google Scholar] [CrossRef]
- DeepSeek-AI; Liu, A.; Feng, B.; Wang, B.; Wang, B.; Liu, B.; Zhao, C.; Dengr, C.; Ruan, C.; Dai, D.; et al. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv 2024, arXiv:2405.04434. [Google Scholar] [CrossRef]
- Wang, B.; Komatsuzaki, A. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. 2021. Available online: https://github.com/kingoflolz/mesh-transformer-jax (accessed on 14 May 2025).
- Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling Instruction-Finetuned Language Models. arXiv 2022, arXiv:2210.11416. [Google Scholar] [CrossRef]
- Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; et al. PaLM 2 Technical Report. arXiv 2023, arXiv:2305.10403. [Google Scholar] [CrossRef]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics, Virtual, 5–10 July 2020; pp. 7871–7880. [Google Scholar] [CrossRef]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
- TheBloke. Llama 2 70B Chat—AWQ. 2023. Available online: https://huggingface.co/TheBloke/Llama-2-70B-Chat-AWQ (accessed on 14 May 2025).
- Ai@Meta. Llama 3 Model Card. 2024. Available online: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md (accessed on 14 May 2025).
- Meta AI. Introducing Llama 3.1: Our Most Capable Models to Date. 2024. Available online: https://ai.meta.com/blog/meta-llama-3-1/ (accessed on 14 May 2025).
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Singh Chaplot, D.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
- Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Singh Chaplot, D.; de las Casas, D.; Hanna, E.B.; Bressand, F.; et al. Mixtral of Experts. arXiv 2024, arXiv:2401.04088. [Google Scholar] [CrossRef]
- Nomic, A.I. GPT4All: Private, Local AI Chatbot Platform by Nomic. 2025. Available online: https://www.nomic.ai/gpt4all (accessed on 14 May 2025).
- Liu, Z.; Ping, W.; Roy, R.; Xu, P.; Lee, C.; Shoeybi, M.; Catanzaro, B. ChatQA: Surpassing GPT-4 on Conversational QA and RAG. arXiv 2024, arXiv:2401.10225. [Google Scholar] [CrossRef]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- OpenAI Product. Available online: https://openai.com/product (accessed on 14 May 2025).
- OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Leoni Aleman, F.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- OpenAI; Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; et al. GPT-4o System Card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
- Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
- Jimeno Yepes, A.; You, Y.; Milczek, J.; Laverde, S.; Li, R. Financial Report Chunking for Effective Retrieval Augmented Generation. arXiv 2024, arXiv:2402.05131. [Google Scholar] [CrossRef]
- Ge, J.; Sun, S.; Owens, J.; Galvez, V.; Gologorskaya, O.; Lai, J.C.; Pletcher, M.J.; Lai, K. Development of a Liver Disease-Specific Large Language Model Chat Interface using Retrieval Augmented Generation. medRxiv 2023. [Google Scholar] [CrossRef]
- Miao, J.; Thongprayoon, C.; Suppadungsuk, S.; Garcia Valencia, O.A.; Cheungpasitporn, W. Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications. Medicina 2024, 60, 445. [Google Scholar] [CrossRef]
- Jiang, Z.; Ma, X.; Chen, W. LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs. arXiv 2024, arXiv:2406.15319. [Google Scholar] [CrossRef]
- Jin, M.; Shahriar, S.; Tufano, M.; Shi, X.; Lu, S.; Sundaresan, N.; Svyatkovskiy, A. InferFix: End-to-End Program Repair with LLMs. In Proceedings of the ESEC/FSE 2023—Proceedings of the 31st ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 2023; pp. 1646–1656. [Google Scholar] [CrossRef]
- Cheng, P.; Ding, Y.; Ju, T.; Wu, Z.; Du, W.; Yi, P.; Zhang, Z.; Liu, G. TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models. arXiv 2024, arXiv:2405.13401. [Google Scholar] [CrossRef]
- Rackauckas, Z. RAG-Fusion: A New Take on Retrieval-Augmented Generation. arXiv 2024, arXiv:2402.03367. [Google Scholar] [CrossRef]
- Dong, G.; Zhu, Y.; Zhang, C.; Wang, Z.; Dou, Z.; Wen, J.R. Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation. arXiv 2024, arXiv:2406.18676. [Google Scholar] [CrossRef]
- Wang, Z.; Araki, J.; Jiang, Z.; Parvez, M.R.; Neubig, G. Learning to Filter Context for Retrieval-Augmented Generation. arXiv 2023, arXiv:2311.08377. [Google Scholar] [CrossRef]
- Soudani, H.; Kanoulas, E.; Hasibi, F. Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, Tokyo, Japan, 9–12 December 2024. [Google Scholar] [CrossRef]
- Xu, S.; Pang, L.; Shen, H.; Cheng, X.; Chua, T.S. Search-in-the-Chain: Interactively Enhancing Large Language Models with Search for Knowledge-intensive Tasks. In Proceedings of the WWW 2024—Proceedings of the ACM Web Conference, Singapore, 13–17 May 2024; Association for Computing Machinery, Inc.: Singapore, 2024; pp. 1362–1373. [Google Scholar] [CrossRef]
- Ke, Z.; Kong, W.; Li, C.; Zhang, M.; Mei, Q.; Bendersky, M. Bridging the Preference Gap between Retrievers and LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 10438–10451. [Google Scholar] [CrossRef]
- Cuconasu, F.; Trappolini, G.; Siciliano, F.; Filice, S.; Campagnano, C.; Maarek, Y.; Tonellotto, N.; Silvestri, F. The Power of Noise: Redefining Retrieval for RAG Systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024. [Google Scholar] [CrossRef]
- Baek, J.; Jeong, S.; Kang, M.; Park, J.C.; Hwang, S.J. Knowledge-Augmented Language Model Verification. In Proceedings of the EMNLP 2023—2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics (ACL): Singapore, 2023; pp. 1720–1736. [Google Scholar]
- Barnett, S.; Kurniawan, S.; Thudumu, S.; Brannelly, Z.; Abdelrazek, M. Seven Failure Points When Engineering a Retrieval Augmented Generation System. arXiv 2024, arXiv:2401.05856. [Google Scholar] [CrossRef]
- Adolphs, L.; Shuster, K.; Urbanek, J.; Szlam, A.; Weston, J. Reason first, then respond: Modular Generation for Knowledge-infused Dialogue. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 7141–7161. [Google Scholar]
- Guo, Y.; Qiu, W.; Leroy, G.; Wang, S.; Cohen, T. Retrieval augmentation of large language models for lay language generation. J. Biomed. Inform. 2024, 149, 104580. [Google Scholar] [CrossRef]
- Shi, Z.; Zhang, S.; Sun, W.; Gao, S.; Ren, P.; Chen, Z.; Ren, Z. Generate-then-Ground in Retrieval-Augmented Generation for Multi-hop Question Answering. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 7339–7353. [Google Scholar] [CrossRef]
- Jiang, Z.; Xu, F.F.; Gao, L.; Sun, Z.; Liu, Q.; Dwivedi-Yu, J.; Yang, Y.; Callan, J.; Neubig, G. Active Retrieval Augmented Generation. arXiv 2023, arXiv:2305.06983. [Google Scholar] [CrossRef]
- Su, W.; Tang, Y.; Ai, Q.; Wu, Z.; Liu, Y. DRAGIN: Dynamic Retrieval Augmented Generation based on the Information Needs of Large Language Models. arXiv 2024, arXiv:2403.10081. [Google Scholar] [CrossRef]
- Béchard, P.; Marquez Ayala, O. Reducing hallucination in structured outputs via Retrieval-Augmented Generation. arXiv 2024, arXiv:2404.08189. [Google Scholar] [CrossRef]
- Li, J.; Liu, Y.; Fan, W.; Wei, X.Y.; Liu, H.; Tang, J.; Li, Q. Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective. arXiv 2023, arXiv:2306.06615. [Google Scholar] [CrossRef]
- Tsai, Y.; Liu, M.; Ren, H. RTLFixer: Automatically Fixing RTL Syntax Errors with Large Language Model. In Proceedings of the 61st ACM/IEEE Design Automation Conference, San Francisco, CA, USA, 23–27 June 2024. [Google Scholar] [CrossRef]
- Matsumoto, N.; Moran, J.; Choi, H.; Hernandez, M.E.; Venkatesan, M.; Wang, P.; Moore, J.H. KRAGEN: A knowledge graph-enhanced RAG framework for biomedical problem solving using large language models. Bioinformatics 2024, 40, btae353. [Google Scholar] [CrossRef]
- Zeng, S.; Zhang, J.; He, P.; Liu, Y.; Xing, Y.; Xu, H.; Ren, J.; Chang, Y.; Wang, S.; Yin, D.; et al. The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG). In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 4505–4524. [Google Scholar] [CrossRef]
- Yu, H.; Guo, P.; Sano, A. Zero-Shot ECG Diagnosis with Large Language Models and Retrieval-Augmented Generation. In Proceedings of the Machine Learning for Health (ML4H), PMLR, New Orleans, LA, USA, 10 December 2023; pp. 650–663. [Google Scholar]
- Jin, J.; Zhu, Y.; Dong, G.; Zhang, Y.; Yang, X.; Zhang, C.; Zhao, T.; Yang, Z.; Dou, Z.; Wen, J.R. FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research. arXiv 2024, arXiv:2405.13576. [Google Scholar] [CrossRef]
- Wang, B.; Ping, W.; Xu, P.; McAfee, L.; Liu, Z.; Shoeybi, M.; Dong, Y.; Kuchaiev, O.; Li, B.; Xiao, C.; et al. Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study. In Proceedings of the EMNLP 2023—2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics (ACL): Singapore, 2023; pp. 7763–7786. [Google Scholar]
- Hu, Y.; Lei, Z.; Zhang, Z.; Pan, B.; Ling, C.; Zhao, L. GRAG: Graph Retrieval-Augmented Generation. arXiv 2024, arXiv:2405.16506. [Google Scholar] [CrossRef]
- Levonian, Z.; Li, C.; Zhu, W.; Gade, A.; Henkel, O.; Postle, M.E.; Xing, W. Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference. arXiv 2023, arXiv:2310.03184. [Google Scholar] [CrossRef]
- Yu, W. Retrieval-augmented generation across heterogeneous knowledge. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, Seattle, WA, USA and Online, 10–15 July 2022; pp. 52–58. [Google Scholar]
- Du, X.; Ji, H. Retrieval-Augmented Generative Question Answering for Event Argument Extraction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 4649–4666. [Google Scholar] [CrossRef]
- Di Palma, D. Retrieval-Augmented Recommender System: Enhancing Recommender Systems with Large Language Models. In Proceedings of the 17th ACM Conference on Recommender Systems, Singapore, 18–22 September 2023. [Google Scholar] [CrossRef]
- Jeong, M.; Sohn, J.; Sung, M.; Kang, J. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics 2024, 40, i119–i129. [Google Scholar] [CrossRef]
- Yu, W.; Zhang, H.; Pan, X.; Cao, P.; Ma, K.; Li, J.; Wang, H.; Yu, D. Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 14672–14685. [Google Scholar] [CrossRef]
- Wang, Z.; Liu, A.; Lin, H.; Li, J.; Ma, X.; Liang, Y. RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation. arXiv 2024, arXiv:2403.05313. [Google Scholar] [CrossRef]
- Wu, Y.; Zhu, J.; Xu, S.; Shum, K.; Niu, C.; Zhong, R.; Song, J.; Zhang, T. RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models. arXiv 2023, arXiv:2401.00396. [Google Scholar] [CrossRef]
- Wang, X.; Wang, Z.; Gao, X.; Zhang, F.; Wu, Y.; Xu, Z.; Shi, T.; Wang, Z.; Li, S.; Qian, Q.; et al. Searching for Best Practices in Retrieval-Augmented Generation. arXiv 2024, arXiv:2407.01219. [Google Scholar] [CrossRef]
- Cheng, X.; Luo, D.; Chen, X.; Liu, L.; Zhao, D.; Yan, R. Lift Yourself Up: Retrieval-augmented Text Generation with Self Memory. arXiv 2023, arXiv:2305.02437. [Google Scholar] [CrossRef]
- Wang, J.; Jiang, R.; Yang, C.; Wu, Z.; Onizuka, M.; Shibasaki, R.; Koshizuka, N.; Xiao, C. Large Language Models as Urban Residents: An LLM Agent Framework for Personal Mobility Generation. arXiv 2024, arXiv:2402.14744. [Google Scholar] [CrossRef]
- Yang, Y.; Xu, C.; Guo, J.; Feng, T.; Ruan, C. Improving the RAG-based Personalized Discharge Care System by Introducing the Memory Mechanism. Preprints 2024. [Google Scholar] [CrossRef]
- Baek, J.; Chandrasekaran, N.; Cucerzan, S.; Herring, A.; Jauhar, S.K. Knowledge-Augmented Large Language Models for Personalized Contextual Query Suggestion. In Proceedings of the ACM Web Conference, Singapore, 13–17 May 2024. [Google Scholar] [CrossRef]
- Parvez, M.R.; Ahmad, W.U.; Chakraborty, S.; Ray, B.; Chang, K.W. Retrieval Augmented Code Generation and Summarization. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 2719–2734. [Google Scholar]
- Tian, Z.; Bi, W.; Li, X.; Zhang, N.L. Learning to abstract for memory-augmented conversational response generation. In Proceedings of the ACL 2019—57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 3816–3825. [Google Scholar]
- Cheng, X.; Wang, X.; Zhang, X.; Ge, T.; Chen, S.Q.; Wei, F.; Zhang, H.; Zhao, D. xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token. arXiv 2024, arXiv:2405.13792. [Google Scholar] [CrossRef]
- Shi, E.; Wang, Y.; Tao, W.; Du, L.; Zhang, H.; Han, S.; Zhang, D.; Sun, H. RACE: Retrieval-augmented Commit Message Generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 5520–5530. [Google Scholar] [CrossRef]
- Thulke, D.; Daheim, N.; Dugast, C.; Ney, H. Efficient Retrieval Augmented Generation from Unstructured Knowledge for Task-Oriented Dialog. arXiv 2021, arXiv:2102.04643. [Google Scholar] [CrossRef]
- Alkhalaf, M.; Yu, P.; Yin, M.; Deng, C. Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records. J. Biomed. Inform. 2024, 156, 104662. [Google Scholar] [CrossRef] [PubMed]
- Ranjit, M.; Ganapathy, G.; Manuel, R.; Ganu, T. Retrieval Augmented Chest X-Ray Report Generation using OpenAI GPT models. arXiv 2023, arXiv:2305.03660. [Google Scholar] [CrossRef]
- Dixit, T.; Paranjape, B.; Hajishirzi, H.; Zettlemoyer, L. CORE: A Retrieve-then-Edit Framework for Counterfactual Data Generation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 2964–2984. [Google Scholar]
- Salemi, A.; Zamani, H. Evaluating Retrieval Quality in Retrieval-Augmented Generation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024. [Google Scholar] [CrossRef]
- Tang, Y.; Yang, Y. MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries. arXiv 2024, arXiv:2401.15391. [Google Scholar] [CrossRef]
- Xue, J.; Zheng, M.; Hu, Y.; Liu, F.; Chen, X.; Lou, Q. BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models. arXiv 2024, arXiv:2406.00083. [Google Scholar] [CrossRef]
- Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking Large Language Models in Retrieval-Augmented Generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Wooldridge, M., Dy, J., Natarajan, S., Eds.; Volume 38, pp. 17754–17762. [Google Scholar] [CrossRef]
- Deng, G.; Liu, Y.; Wang, K.; Li, Y.; Zhang, T.; Liu, Y. Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning. arXiv 2024, arXiv:2402.08416. [Google Scholar] [CrossRef]
- Wu, K.; Wu, E.; Zou, J. ClashEval: Quantifying the tug-of-war between an LLM’s internal prior and external evidence. arXiv 2024, arXiv:2404.10198. [Google Scholar] [CrossRef]
- Chen, J.; Hu, X.; Li, Z.; Gao, C.; Xia, X.; Lo, D. Code Search is All You Need? Improving Code Suggestions with Code Search. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, Lisbon, Portugal, 14–20 April 2024. [Google Scholar] [CrossRef]
- Liu, Y.; Peng, X.; Zhang, X.; Liu, W.; Yin, J.; Cao, J.; Du, T. RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 4730–4749. [Google Scholar] [CrossRef]
- Kresevic, S.; Giuffrè, M.; Ajcevic, M.; Accardo, A.; Crocè, L.S.; Shung, D.L. Optimization of hepatological clinical guidelines interpretation by large language models: A retrieval augmented generation-based framework. NPJ Digit. Med. 2024, 7, 102. [Google Scholar] [CrossRef]
- Su, Y.; Wang, Y.; Cai, D.; Baker, S.; Korhonen, A.; Collier, N. PROTOTYPE-TO-STYLE: Dialogue Generation with Style-Aware Editing on Retrieval Memory. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2152–2161. [Google Scholar] [CrossRef]
- Shi, W.; Zhuang, Y.; Zhu, Y.; Iwinski, H.; Wattenbarger, M.; Wang, M.D. Retrieval-Augmented Large Language Models for Adolescent Idiopathic Scoliosis Patients in Shared Decision-Making. In Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Houston, TX, USA, 3–6 September 2023. [Google Scholar] [CrossRef]
- Colverd, G.; Darm, P.; Silverberg, L.; Kasmanoff, N. FloodBrain: Flood Disaster Reporting by Web-based Retrieval Augmented Generation with an LLM. arXiv 2023, arXiv:2311.02597. [Google Scholar] [CrossRef]
- Saad-Falcon, J.; Khattab, O.; Potts, C.; Zaharia, M. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. arXiv 2023, arXiv:2311.09476. [Google Scholar] [CrossRef]
- Es, S.; James, J.; Espinosa-Anke, L.; Schockaert, S. RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv 2023, arXiv:2309.15217. [Google Scholar] [CrossRef]
- Lyu, Y.; Li, Z.; Niu, S.; Xiong, F.; Tang, B.; Wang, W.; Wu, H.; Liu, H.; Xu, T.; Chen, E.; et al. CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models. arXiv 2024, arXiv:2401.17043. [Google Scholar] [CrossRef]
- Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; et al. Natural Questions: A Benchmark for Question Answering Research. Trans. Assoc. Comput. Linguist. 2019, 7, 452–466. [Google Scholar] [CrossRef]
- Nguyen, T.; Rosenberg, M.; Song, X.; Gao, J.; Tiwary, S.; Majumder, R.; Deng, L. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv 2016, arXiv:1611.09268v1. [Google Scholar]
- Butler, U. Open Australian Legal Corpus. 2025. Available online: https://huggingface.co/datasets/isaacus/open-australian-legal-corpus (accessed on 14 May 2025).
- Tuggener, D.; von Däniken, P.; Peetz, T.; Cieliebak, M. LEDGAR: A Large-Scale Multi-label Corpus for Text Classification of Legal Provisions in Contracts. In Proceedings of the Twelfth Language Resources and Evaluation Conference; European Language Resources Association, Marseille, France, 11–16 May 2020; pp. 1235–1241. [Google Scholar]
- Wang, L.L.; Lo, K.; Chandrasekhar, Y.; Reas, R.; Yang, J.; Burdick, D.; Eide, D.; Funk, K.; Katsis, Y.; Kinney, R.M.; et al. CORD-19: The COVID-19 Open Research Dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online, 5–10 July 2020; Association for Computational Linguistics: Miami, FL, USA, 2020. [Google Scholar]
- Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W.; Lu, X. PubMedQA: A Dataset for Biomedical Research Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 2567–2577. [Google Scholar] [CrossRef]
- Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 1–4 October 2018; pp. 2369–2380. [Google Scholar] [CrossRef]
- Ho, X.; Duong Nguyen, A.K.; Sugawara, S.; Aizawa, A. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 6609–6625. [Google Scholar] [CrossRef]
- Chen, X.; Fang, H.; Lin, T.Y.; Vedantam, R.; Gupta, S.; Dollar, P.; Zitnick, C.L. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv 2015, arXiv:1504.00325. [Google Scholar] [CrossRef]
- Husain, H.; Wu, H.H.; Gazit, T.; Allamanis, M.; Brockschmidt, M. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv 2019, arXiv:1909.09436. [Google Scholar] [CrossRef]
- Wilmot, D.; Keller, F. Memory and Knowledge Augmented Language Models for Inferring Salience in Long-Form Stories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 851–865. [Google Scholar] [CrossRef]
- Chan, C.M.; Xu, C.; Yuan, R.; Luo, H.; Xue, W.; Guo, Y.; Fu, J. RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation. arXiv 2024, arXiv:2404.00610. [Google Scholar] [CrossRef]
- Asai, A.; Gardner, M.; Hajishirzi, H. Evidentiality-guided Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2022), Seattle, WA, USA, 10–15 July 2022; pp. 2226–2243. [Google Scholar]
- Abdulrahman Alawwad, H.; Alhothali, A.; Naseem, U.; Alkhathlan, A.; Jamal, A. Enhancing Textbook Question Answering Task with Large Language Models and Retrieval Augmented Generation. arXiv 2024, arXiv:2402.05128. [Google Scholar] [CrossRef]
- Chaudhari, H.; Severi, G.; Abascal, J.; Jagielski, M.; Choquette-Choo, C.A.; Nasr, M.; Nita-Rotaru, C.; Oprea, A. Phantom: General Trigger Attacks on Retrieval Augmented Language Generation. arXiv 2024, arXiv:2405.20485. [Google Scholar] [CrossRef]
- Qi, Z.; Zhang, H.; Xing, E.; Kakade, S.; Lakkaraju, H. Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems. arXiv 2024, arXiv:2402.17840. [Google Scholar] [CrossRef]
- Ovadia, O.; Brief, M.; Mishaeli, M.; Elisha, O. Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs. arXiv 2023, arXiv:2312.05934. [Google Scholar] [CrossRef]
- Salemi, A.; Kallumadi, S.; Zamani, H. Optimization Methods for Personalizing Large Language Models through Retrieval Augmentation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024. [Google Scholar] [CrossRef]
- Li, Z.; Li, C.; Zhang, M.; Mei, Q.; Bendersky, M. Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. arXiv 2024, arXiv:2407.16833. [Google Scholar] [CrossRef]
- Lobentanzer, S.; Feng, S.; Bruderer, N.; Maier, A.; Díaz, A.G.; Strange, A.; Ismail, A.; Kulaga, A.; Dugourd, A.; Zdrazil, B.; et al. A platform for the biomedical application of large language models. Nat. Biotechnol. 2025, 43, 166–169. [Google Scholar] [CrossRef] [PubMed]
- Joshi, M.; Choi, E.; Weld, D.; Zettlemoyer, L. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1601–1611. [Google Scholar] [CrossRef]
- Trivedi, H.; Balasubramanian, N.; Khot, T.; Sabharwal, A. MuSiQue: Multihop Questions via Single-hop Question Composition. Trans. Assoc. Comput. Linguist. 2022, 10, 539–554. [Google Scholar] [CrossRef]
- Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; Mittal, A. FEVER: A Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 809–819. [Google Scholar] [CrossRef]
- Geva, M.; Khashabi, D.; Segal, E.; Khot, T.; Roth, D.; Berant, J. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Trans. Assoc. Comput. Linguist. 2021, 9, 346–361. [Google Scholar] [CrossRef]
- Dinan, E.; Roller, S.; Shuster, K.; Fan, A.; Auli, M.; Weston, J. Wizard of Wikipedia: Knowledge-Powered Conversational agents. arXiv 2018, arXiv:1811.01241. [Google Scholar] [CrossRef]
- Berant, J.; Chou, A.; Frostig, R.; Liang, P. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1533–1544. [Google Scholar]
- Clark, P.; Cowhey, I.; Etzioni, O.; Khot, T.; Sabharwal, A.; Schoenick, C.; Tafjord, O. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv 2018, arXiv:1803.05457. [Google Scholar] [CrossRef]
- Fan, A.; Jernite, Y.; Perez, E.; Grangier, D.; Weston, J.; Auli, M. ELI5: Long Form Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 3558–3567. [Google Scholar] [CrossRef]
- Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language Understanding. arXiv 2020, arXiv:2009.03300. [Google Scholar] [CrossRef]
- Kočiský, T.; Schwarz, J.; Blunsom, P.; Dyer, C.; Hermann, K.M.; Melis, G.; Grefenstette, E. The NarrativeQA Reading Comprehension Challenge. arXiv 2017, arXiv:1712.07040. [Google Scholar] [CrossRef]
- Mallen, A.; Asai, A.; Zhong, V.; Das, R.; Khashabi, D.; Hajishirzi, H. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. arXiv 2022, arXiv:2212.10511. [Google Scholar] [CrossRef]
- Yih, S.W.t.; Richardson, M.; Meek, C.; Chang, M.W.; Suh, J. The Value of Semantic Parse Labeling for Knowledge Base Question Answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, 7–12 August 2016. [Google Scholar]
- Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.t. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 8–12 November 2020; pp. 6769–6781. [Google Scholar] [CrossRef]
- Stelmakh, I.; Luan, Y.; Dhingra, B.; Chang, M.W. ASQA: Factoid Questions Meet Long-Form Answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 8273–8288. [Google Scholar] [CrossRef]
- Mihaylov, T.; Clark, P.; Khot, T.; Sabharwal, A. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. arXiv 2018, arXiv:1809.02789. [Google Scholar] [CrossRef]
- Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 2383–2392. [Google Scholar] [CrossRef]
- Elsahar, H.; Vougiouklis, P.; Remaci, A.; Gravier, C.; Hare, J.; Laforest, F.; Simperl, E. T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Lin, S.; Hilton, J.; Evans, O. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 3214–3252. [Google Scholar] [CrossRef]
- Levy, O.; Seo, M.; Choi, E.; Zettlemoyer, L. Zero-Shot Relation Extraction via Reading Comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, BC, Canada, 3–4 August 2017; pp. 333–342. [Google Scholar] [CrossRef]
- Reddy, S.; Chen, D.; Manning, C.D. CoQA: A Conversational Question Answering Challenge. arXiv 2018, arXiv:1808.07042. [Google Scholar] [CrossRef]
- Bai, Y.; Lv, X.; Zhang, J.; Lyu, H.; Tang, J.; Huang, Z.; Du, Z.; Liu, X.; Zeng, A.; Hou, L.; et al. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. arXiv 2023, arXiv:2308.14508. [Google Scholar] [CrossRef]
- Bisk, Y.; Zellers, R.; Le bras, R.; Gao, J.; Choi, Y. PIQA: Reasoning about Physical Commonsense in Natural Language. Proc. AAAI Conf. Artif. Intell. 2020, 34, 7432–7439. [Google Scholar] [CrossRef]
- Dasigi, P.; Lo, K.; Beltagy, I.; Cohan, A.; Smith, N.A.; Gardner, M. A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers. arXiv 2021, arXiv:2105.03011. [Google Scholar] [CrossRef]
- Guo, Q.; Cao, S.; Yi, Z. A medical question answering system using large language models and knowledge graphs. Int. J. Intell. Syst. 2022, 37, 8548–8564. [Google Scholar] [CrossRef]
- Hayashi, H.; Budania, P.; Wang, P.; Ackerson, C.; Neervannan, R.; Neubig, G. WikiAsp: A Dataset for Multi-domain Aspect-based Summarization. Trans. Assoc. Comput. Linguist. 2021, 9, 211–225. [Google Scholar] [CrossRef]
- Yang, Y.; Yih, W.T.; Meek, C. WikiQA: A challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015. [Google Scholar]
- Press, O.; Zhang, M.; Min, S.; Schmidt, L.; Smith, N.A.; Lewis, M. Measuring and Narrowing the Compositionality Gap in Language Models. arXiv 2022, arXiv:2210.03350. [Google Scholar] [CrossRef]
- Krithara, A.; Nentidis, A.; Bougiatiotis, K.; Paliouras, G. BioASQ-QA: A manually curated corpus for Biomedical Question Answering. Sci. Data 2023, 10, 170. [Google Scholar] [CrossRef] [PubMed]
- Clark, C.; Lee, K.; Chang, M.W.; Kwiatkowski, T.; Collins, M.; Toutanova, K. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 2924–2936. [Google Scholar] [CrossRef]
- See, A.; Liu, P.J.; Manning, C.D. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1073–1083. [Google Scholar] [CrossRef]
- Lu, S.; Guo, D.; Ren, S.; Huang, J.; Svyatkovskiy, A.; Blanco, A.; Clement, C.; Drain, D.; Jiang, D.; Tang, D.; et al. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv 2021, arXiv:2102.04664. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv 2019, arXiv:1910.10683. [Google Scholar] [CrossRef]
- Wenzek, G.; Lachaux, M.A.; Conneau, A.; Chaudhary, V.; Guzmán, F.; Joulin, A.; Grave, E. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 4003–4012. [Google Scholar]
- Talmor, A.; Herzig, J.; Lourie, N.; Berant, J. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4149–4158. [Google Scholar] [CrossRef]
- Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2556–2565. [Google Scholar] [CrossRef]
- Conover, M.; Hayes, M.; Mathur, A.; Xie, J.; Wan, J.; Shah, S.; Ghodsi, A.; Wendell, P.; Zaharia, M.; Xin, R. Databricks-Dolly-15K. 2023. Available online: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm (accessed on 14 May 2025).
- Saha, S.; Yadav, P.; Bauer, L.; Bansal, M. ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense Reasoning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7716–7740. [Google Scholar] [CrossRef]
- Jia, X.; Gavves, E.; Fernando, B.; Tuytelaars, T. Guiding Long-Short Term Memory for Image Caption Generation. arXiv 2015, arXiv:1509.04942. [Google Scholar] [CrossRef]
- Luo, M.; Zeng, Y.; Banerjee, P.; Baral, C. Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6417–6431. [Google Scholar] [CrossRef]
- Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; Choi, Y. HellaSwag: Can a Machine Really Finish Your Sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4791–4800. [Google Scholar] [CrossRef]
- Ferguson, J.; Gardner, M.; Hajishirzi, H.; Khot, T.; Dasigi, P. IIRC: A Dataset of Incomplete Information Reading Comprehension Questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 1137–1147. [Google Scholar] [CrossRef]
- Schuhmann, C.; Vencu, R.; Beaumont, R.; Kaczmarczyk, R.; Mullis, C.; Katta, A.; Coombes, T.; Jitsev, J.; Komatsuzaki, A. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. arXiv 2021, arXiv:2111.02114. [Google Scholar] [CrossRef]
- Talmor, A.; Yoran, O.; Catav, A.; Lahav, D.; Wang, Y.; Asai, A.; Ilharco, G.; Hajishirzi, H.; Berant, J. MultiModalQA: Complex Question Answering over Text, Tables and Images. arXiv 2021, arXiv:2104.06039. [Google Scholar] [CrossRef]
- Marino, K.; Rastegari, M.; Farhadi, A.; Mottaghi, R. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. arXiv 2019, arXiv:1906.00067. [Google Scholar] [CrossRef]
- Zhang, T.; Luo, H.; Chuang, Y.S.; Fang, W.; Gaitskell, L.; Hartvigsen, T.; Wu, X.; Fox, D.; Meng, H.; Glass, J. Interpretable Unified Language Checking. arXiv 2023, arXiv:2304.03728. [Google Scholar] [CrossRef]
- PubMed Database. 1996. Available online: https://pubmed.ncbi.nlm.nih.gov/ (accessed on 14 May 2025).
- Zhong, M.; Yin, D.; Yu, T.; Zaidi, A.; Mutuma, M.; Jha, R.; Awadallah, A.H.; Celikyilmaz, A.; Liu, Y.; Qiu, X.; et al. QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 5905–5921. [Google Scholar] [CrossRef]
- Zellers, R.; Holtzman, A.; Rashkin, H.; Bisk, Y.; Farhadi, A.; Roesner, F.; Choi, Y. Defending against neural fake news. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; p. 812. [Google Scholar]
- WikiData. Available online: https://www.wikipedia.org/ (accessed on 14 May 2025).
- Izacard, G.; Lewis, P.; Lomeli, M.; Hosseini, L.; Petroni, F.; Schick, T.; Dwivedi-Yu, J.; Joulin, A.; Riedel, S.; Grave, E. Atlas: Few-shot Learning with Retrieval Augmented Language Models. arXiv 2022, arXiv:2208.03299. [Google Scholar] [CrossRef]
- Li, S.; Ji, H.; Han, J. Document-Level Event Argument Extraction by Conditional Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 894–908. [Google Scholar] [CrossRef]
- Merity, S.; Xiong, C.; Bradbury, J.; Socher, R. Pointer Sentinel Mixture Models. arXiv 2016, arXiv:1609.07843. [Google Scholar] [CrossRef]
- Craswell, N.; Mitra, B.; Yilmaz, E.; Campos, D.; Voorhees, E.M. Overview of the TREC 2019 deep learning track. arXiv 2020, arXiv:2003.07820. [Google Scholar] [CrossRef]
- Craswell, N.; Mitra, B.; Yilmaz, E.; Campos, D.F.; Voorhees, E.M. Overview of the TREC 2020 Deep Learning Track. arXiv 2021, arXiv:2102.07662. [Google Scholar] [CrossRef]
- Doddington, G.; Mitchell, A.; Przybocki, M.; Ramshaw, L.; Strassel, S.; Weischedel, R. The Automatic Content Extraction (ACE) Program—Tasks, Data, and Evaluation. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal, 26–28 May 2004. [Google Scholar]
- Krishna, R.; Hata, K.; Ren, F.; Fei-Fei, L.; Niebles, J.C. Dense-Captioning Events in Videos. arXiv 2017, arXiv:1705.00754. [Google Scholar] [CrossRef]
- Gurulingappa, H.; Rajput, A.M.; Roberts, A.; Fluck, J.; Hofmann-Apitius, M.; Toldo, L. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. J. Biomed. Inform. 2012, 45, 885–892. [Google Scholar] [CrossRef] [PubMed]
- Lu, W.; Zeng, Z.; Wang, J.; Lu, Z.; Chen, Z.; Zhuang, H.; Chen, C. Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge. arXiv 2024, arXiv:2404.05880. [Google Scholar] [CrossRef]
- Nie, Y.; Williams, A.; Dinan, E.; Bansal, M.; Weston, J.; Kiela, D. Adversarial NLI: A New Benchmark for Natural Language Understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4885–4901. [Google Scholar] [CrossRef]
- Mao, J.; Ye, J.; Qian, Y.; Pavone, M.; Wang, Y. A Language Agent for Autonomous Driving. arXiv 2023, arXiv:2311.10813. [Google Scholar] [CrossRef]
- Zhang, X.; Zhao, J.; LeCun, Y. Character-level Convolutional Networks for Text Classification. In Proceedings of the 29th International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 7–12 December 2015. [Google Scholar]
- Hoffart, J.; Yosef, M.A.; Bordino, I.; Fürstenau, H.; Pinkal, M.; Spaniol, M.; Taneva, B.; Thater, S.; Weikum, G. Robust Disambiguation of Named Entities in Text. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, 27–31 July 2011; pp. 782–792. [Google Scholar]
- Xiao, Y.; Hou, Y.; Zhou, H.; Diallo, G.; Fiszman, M.; Wolfson, J.; Kilicoglu, H.; Chen, Y.; Su, C.; Xu, H.; et al. Repurposing Non-pharmacological Interventions for Alzheimer’s Diseases through Link Prediction on Biomedical Literature. medRxiv 2023. [Google Scholar] [CrossRef] [PubMed]
- Romano, J.D.; Truong, V.; Kumar, R.; Venkatesan, M.; Graham, B.E.; Hao, Y.; Matsumoto, N.; Li, X.; Wang, Z.; Ritchie, M.D.; et al. The Alzheimer’s Knowledge Base: A Knowledge Graph for Alzheimer Disease Research. J. Med. Internet Res. 2024, 26, e46777. [Google Scholar] [CrossRef]
- Dong, L.; Huang, S.; Wei, F.; Lapata, M.; Zhou, M.; Xu, K. Learning to Generate Product Reviews from Attributes. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, 3–7 April 2017; Volume 1, pp. 623–632. [Google Scholar]
- McAuley, J.; Leskovec, J. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, Hong Kong, China, 12–16 October 2013. [Google Scholar] [CrossRef]
- Min, S.; Michael, J.; Hajishirzi, H.; Zettlemoyer, L. AmbigQA: Answering Ambiguous Open-domain Questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 5783–5797. [Google Scholar] [CrossRef]
- Penzel, T.; Moody, G.B.; Mark, R.G.; Goldberger, A.L.; Peter, J.H. The apnea-ECG database. In Proceedings of the Computers in Cardiology 2000, Vol.27 (Cat. 00CH37163), Cambridge, MA, USA, 24–27 September 2000; pp. 255–258. [Google Scholar]
- Oard, D.; Webber, W.; Kirsch, D.; Golitsynskiy, S. Avocado Research Email Collection; Linguistic Data Consortium: Philadelphia, PA, USA, 2015. [Google Scholar]
- Parrish, A.; Chen, A.; Nangia, N.; Padmakumar, V.; Phang, J.; Thompson, J.; Htut, P.M.; Bowman, S. BBQ: A hand-built bias benchmark for question answering. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 2086–2105. [Google Scholar] [CrossRef]
- Sharma, E.; Li, C.; Wang, L. BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2204–2213. [Google Scholar] [CrossRef]
- Microsoft. Bing. 2009. Available online: https://www.bing.com/ (accessed on 14 May 2025).
- Min, S.; Krishna, K.; Lyu, X.; Lewis, M.; Yih, W.t.; Koh, P.; Iyyer, M.; Zettlemoyer, L.; Hajishirzi, H. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 12076–12100. [Google Scholar] [CrossRef]
- Mungall, C.J.; McMurry, J.A.; Köhler, S.; Balhoff, J.P.; Borromeo, C.; Brush, M.; Carbon, S.; Conlin, T.; Dunn, N.; Engelstad, M.; et al. The Monarch Initiative: An integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 2017, 45, D712–D722. [Google Scholar] [CrossRef] [PubMed]
- Chalkidis, I.; Jana, A.; Hartung, D.; Bommarito, M.; Androutsopoulos, I.; Katz, D.; Aletras, N. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 4310–4330. [Google Scholar] [CrossRef]
- Bondarenko, M.; Kerr, D.; Sorichetta, A.; Tatem, A. Census/projection-disaggregated gridded population datasets for 189 countries in 2020 using Built-Settlement Growth Model (BSGM) outputs [Dataset]. University of Southampton, Southampton, UK, 2020. Available online: https://www.worldpop.org/doi/10.5258/SOTON/WP00684 (accessed on 14 May 2025).
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar] [CrossRef]
- Edwards, C.; Zhai, C.; Ji, H. Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 595–607. [Google Scholar] [CrossRef]
- Taboureau, O.; Nielsen, S.; Audouze, K.; Weinhold, N.; Edsgärd, D.; Roque, F.; Kouskoumvekaki, I.; Bora, A.; Curpan, R.; Jensen, T.; et al. ChemProt: A disease chemical biology database. Nucleic Acids Res. 2010, 39, D367–D372. [Google Scholar] [CrossRef]
- Chen, Z.; Hernández Cano, A.; Romanou, A.; Bonnet, A.; Matoba, K.; Salvi, F.; Pagliardini, M.; Fan, S.; Köpf, A.; Mohtashami, A.; et al. MEDITRON-70B: Scaling Medical Pretraining for Large Language Models. arXiv 2023, arXiv:2311.16079. [Google Scholar] [CrossRef]
- Tufano, M.; Watson, C.; Bavota, G.; Di Penta, M.; White, M.; Poshyvanyk, D. An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation. arXiv 2018, arXiv:1812.08693. [Google Scholar] [CrossRef]
- Liu, C.; Xia, X.; Lo, D.; Liu, Z.; Hassan, A.E.; Li, S. CodeMatcher: Searching Code Based on Sequential Semantics of Important Query Words. ACM Trans. Softw. Eng. Methodol. 2021, 31, 12. [Google Scholar] [CrossRef]
- CodeParrot. github-jupyter. 2022. Available online: https://huggingface.co/datasets/codeparrot/github-jupyter (accessed on 14 May 2025).
- Speer, R.; Chin, J.; Havasi, C. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. Proc. AAAI Conf. Artif. Intell. 2017, 31, 4444–4451. [Google Scholar] [CrossRef]
- Changpinyo, S.; Sharma, P.; Ding, N.; Soricut, R. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training to Recognize Long-Tail Visual Concepts. arXiv 2021, arXiv:2102.08981. [Google Scholar] [CrossRef]
- Iyer, S.; Konstas, I.; Cheung, A.; Zettlemoyer, L. Mapping Language to Code in Programmatic Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 1643–1652. [Google Scholar] [CrossRef]
- Tjong Kim Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, AB, Canada, 31 May–1 June 2003; pp. 142–147. [Google Scholar]
- Roth, D.; Yih, W.t. A Linear Programming Formulation for Global Inference in Natural Language Tasks. In Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004, Boston, MA, USA, 6–7 May 2004; pp. 1–8. [Google Scholar]
- Wu, C.S.; Madotto, A.; Liu, W.; Fung, P.; Xiong, C. QAConv: Question Answering on Informative Conversations. arXiv 2021, arXiv:2105.06912. [Google Scholar] [CrossRef]
- Chen, Z.; Li, S.; Smiley, C.; Ma, Z.; Shah, S.; Wang, W.Y. ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 6279–6292. [Google Scholar] [CrossRef]
- Byeon, M.; Park, B.; Kim, H.; Lee, S.; Baek, W.; Kim, S. COYO-700M: Image-Text Pair Dataset. arXiv 2022, arXiv:2303.03378. [Google Scholar]
- Onoe, Y.; Zhang, M.J.Q.; Choi, E.; Durrett, G. CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge. arXiv 2021, arXiv:2109.01653. [Google Scholar] [CrossRef]
- Ding, Y.; Wang, Z.; Ahmad, W.U.; Ding, H.; Tan, M.; Jain, N.; Krishna Ramanathan, M.; Nallapati, R.; Bhatia, P.; Roth, D.; et al. CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion. arXiv 2023, arXiv:2310.11248. [Google Scholar] [CrossRef]
- Talmor, A.; Yoran, O.; Le Bras, R.; Bhagavatula, C.; Goldberg, Y.; Choi, Y.; Berant, J. CommonsenseQA 2.0: Exposing the Limits of AI through Gamification. arXiv 2022, arXiv:2201.05320. [Google Scholar] [CrossRef]
- Baudiš, P.; Šedivý, J. Modeling of the Question Answering Task in the YodaQA System. In Experimental IR Meets Multilinguality, Multimodality, and Interaction; Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G., San Juan, E., Capellato, L., Ferro, N., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 222–228. [Google Scholar]
- Ramesh; Vignav, C.N.; Rajpurkar, P. CXR-PRO: MIMIC-CXR with Prior References Omitted (version 1.0.0). PhysioNet 2022. [Google Scholar] [CrossRef]
- Satyapanich, T.; Ferraro, F.; Finin, T. CASIE: Extracting Cybersecurity Event Information from Text. Proc. AAAI Conf. Artif. Intell. 2020, 34, 8749–8757. [Google Scholar] [CrossRef]
- Li, Y.; Su, H.; Shen, X.; Li, W.; Cao, Z.; Niu, S. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Nagoya, Japan, 27 November–1 December 2017; pp. 986–995. [Google Scholar]
- Just, R.; Jalali, D.; Ernst, M.D. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, San Jose, CA, USA, 21–25 July 2014. [Google Scholar] [CrossRef]
- DIG Minecraft. 2025. Available online: https://www.digminecraft.com/ (accessed on 14 May 2025).
- Dua, D.; Wang, Y.; Dasigi, P.; Stanovsky, G.; Singh, S.; Gardner, M. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MI, USA, 2–7 June 2019; pp. 2368–2378. [Google Scholar] [CrossRef]
- Oda, Y.; Fudaba, H.; Neubig, G.; Hata, H.; Sakti, S.; Toda, T.; Nakamura, S. Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation. In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, USA, 9–13 November 2015; pp. 574–584. [Google Scholar] [CrossRef]
- Feng, S.; Wan, H.; Gunasekara, C.; Patel, S.; Joshi, S.; Lastras, L. doc2dial: A Goal-Oriented Document-Grounded Dialogue Dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 8118–8128. [Google Scholar] [CrossRef]
- Wang, S.; Liu, J.; Song, S.; Cheng, J.; Fu, Y.; Guo, P.; Fang, K.; Zhu, Y.; Dou, Z. DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation. arXiv 2024, arXiv:2406.05654. [Google Scholar] [CrossRef]
- Campos, J.A.; Otegi, A.; Soroa, A.; Deriu, J.; Cieliebak, M.; Agirre, E. DoQA—Accessing Domain-Specific FAQs via Conversational QA. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, 5–10 July 2020; pp. 7302–7314. [Google Scholar] [CrossRef]
- Segura-Bedmar, I.; Martínez, P.; Herrero-Zazo, M. SemEval-2013 Task 9: Extraction of Drug-Drug Interactions from Biomedical Texts (DDIExtraction 2013). In Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, GA, USA, 14–15 June 2013; pp. 341–350. [Google Scholar]
- DynaMed. 2025. Available online: https://www.dynamed.com/ (accessed on 14 May 2025).
- Shi, W.; Xu, R.; Zhuang, Y.; Yu, Y.; Zhang, J.; Wu, H.; Zhu, Y.; Ho, J.; Yang, C.; Wang, M.D. EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Volume 2024, pp. 22315–22339. [Google Scholar] [CrossRef]
- Zhou, H.; Huang, M.; Zhang, T.; Zhu, X.; Liu, B. Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory. Proc. AAAI Conf. Artif. Intell. 2018, 32, 730–738. [Google Scholar] [CrossRef]
- Zhang, X.; Chen, Y.; Hu, S.; Xu, Z.; Chen, J.; Khai Hao, M.; Han, X.; Leng Thai, Z.; Wang, S.; Liu, Z.; et al. ∞Bench: Extending Long Context Evaluation Beyond 100K Tokens. arXiv 2024, arXiv:2402.13718. [Google Scholar] [CrossRef]
- Mensink, T.; Uijlings, J.; Castrejon, L.; Goel, A.; Cadar, F.; Zhou, H.; Sha, F.; Araujo, A.; Ferrari, V. Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories. arXiv 2023, arXiv:2306.09224. [Google Scholar] [CrossRef]
- Sciavolino, C.; Zhong, Z.; Lee, J.; Chen, D. Simple Entity-Centric Questions Challenge Dense Retrievers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6138–6148. [Google Scholar] [CrossRef]
- European Association for the Study of the Liver. EASL recommendations on treatment of hepatitis C: Final update of the series. J. Hepatol. 2020, 73, 1170–1218. [Google Scholar] [CrossRef]
- Narayan, S.; Cohen, S.B.; Lapata, M. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 1797–1807. [Google Scholar] [CrossRef]
- Facebook Books Dataset. 2018. Available online: https://github.com/sisinflab/LinkedDatasets/tree/master/facebook_book (accessed on 14 May 2025).
- Aly, R.; Guo, Z.; Schlichtkrull, M.; Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; Cocarascu, O.; Mittal, A. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information. arXiv 2021, arXiv:2106.05707. [Google Scholar] [CrossRef]
- Park, J.; Min, S.; Kang, J.; Zettlemoyer, L.; Hajishirzi, H. FaVIQ: FAct Verification from Information-seeking Questions. arXiv 2021, arXiv:2107.02153. [Google Scholar] [CrossRef]
- Kim, J.; Park, S.; Kwon, Y.; Jo, Y.; Thorne, J.; Choi, E. FactKG: Fact Verification via Reasoning on Knowledge Graphs. arXiv 2023, arXiv:2305.06590. [Google Scholar] [CrossRef]
- Lee, N.; Ping, W.; Xu, P.; Patwary, M.; Fung, P.N.; Shoeybi, M.; Catanzaro, B. Factuality Enhanced Language Models for Open-Ended Text Generation. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
- Kalyan, A.; Kumar, A.; Chandrasekaran, A.; Sabharwal, A.; Clark, P. How much coffee was consumed during EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7318–7328. [Google Scholar] [CrossRef]
- Islam, P.; Kannappan, A.; Kiela, D.; Qian, R.; Scherrer, N.; Vidgen, B. FinanceBench: A New Benchmark for Financial Question Answering. arXiv 2023, arXiv:2311.11944. [Google Scholar] [CrossRef]
- Jiang, K.; Wu, D.; Jiang, H. FreebaseQA: A New Factoid QA Data Set Matching Trivia-Style Question-Answer Pairs with Freebase. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MI, USA, 2–7 June 2019; pp. 318–323. [Google Scholar] [CrossRef]
- Vu, T.; Iyyer, M.; Wang, X.; Constant, N.; Wei, J.; Wei, J.; Tar, C.; Sung, Y.H.; Zhou, D.; Le, Q.; et al. FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 13697–13720. [Google Scholar] [CrossRef]
- Zong, Y.; Qiu, X. GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 8817–8825. [Google Scholar] [CrossRef]
- Su, Y.; Cai, D.; Wang, Y.; Baker, S.; Korhonen, A.; Collier, N.; Liu, X. Stylistic Dialogue Generation via Information-Guided Reinforcement Learning Strategy. arXiv 2020, arXiv:2004.02202. [Google Scholar] [CrossRef]
- Li, M.; Zhou, H.; Zhang, R. Benchingmaking Large Langage Models in Biomedical Triple Extraction. arXiv 2023, arXiv:2310.18463. [Google Scholar] [CrossRef]
- Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; Callan, J.; Neubig, G. PAL: Program-aided Language Models. arXiv 2022, arXiv:2211.10435. [Google Scholar] [CrossRef]
- Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. Training Verifiers to Solve Math Word Problems. arXiv 2021, arXiv:2110.14168. [Google Scholar] [CrossRef]
- Zhou, Y.; Tan, C. Investigating the Effect of Natural Language Explanations on Out-of-Distribution Generalization in Few-shot NLI. In Proceedings of the Second Workshop on Insights from Negative Results in NLP, Punta Cana, Dominican Republic, 10 November 2021; pp. 117–124. [Google Scholar] [CrossRef]
- Gao, L.; Biderman, S.; Black, S.; Golding, L.; Hoppe, T.; Foster, C.; Phang, J.; He, H.; Thite, A.; Nabeshima, N.; et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv 2020, arXiv:2101.00027. [Google Scholar] [CrossRef]
- Harvard Law Case Corpus. 2024. Available online: https://case.law/ (accessed on 14 May 2025).
- Luo, Y.; Shi, M.; Osama Khan, M.; Muneeb Afzal, M.; Huang, H.; Yuan, S.; Tian, Y.; Song, L.; Kouhana, A.; Elze, T.; et al. FairCLIP: Harnessing Fairness in Vision-Language Learning. arXiv 2024, arXiv:2403.19949. [Google Scholar] [CrossRef]
- Li, Y.; Li, Z.; Zhang, K.; Dan, R.; Jiang, S.; Zhang, Y. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. arXiv 2023, arXiv:2303.14070. [Google Scholar] [CrossRef] [PubMed]
- Ling, W.; Blunsom, P.; Grefenstette, E.; Hermann, K.M.; Kočiský, T.; Wang, F.; Senior, A. Latent Predictor Networks for Code Generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 599–609. [Google Scholar] [CrossRef]
- Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Ponde de Oliveira Pinto, H.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
- Liu, J.; Xia, C.S.; Wang, Y.; Zhang, L. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Nakamura, K.; Levy, S.; Tuan, Y.L.; Chen, W.; Wang, W.Y. HybriDialogue: An Information-Seeking Dialogue Dataset Grounded on Tabular and Textual Data. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 481–492. [Google Scholar] [CrossRef]
- IMDb. IMDb Non-Commercial Datasets. 2024. Available online: https://developer.imdb.com/non-commercial-datasets/ (accessed on 14 May 2025).
- Community, Infineon Developer Community. Developer Community Forum Questions. 1999. Available online: https://community.infineon.com/ (accessed on 14 May 2025).
- Documents, Infineon Technologies. XENSIV™—Sensing the World Sensor Solutions for Automotive, Industrial, Consumer and IoT Applications. Available online: https://www.infineon.com/cms/en/product/sensor/mems-microphones/ (accessed on 14 May 2025).
- Chen, Y.; Hu, H.; Luan, Y.; Sun, H.; Changpinyo, S.; Ritter, A.; Chang, M.W. Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 14948–14968. [Google Scholar] [CrossRef]
- Wu, Z.; Parish, R.; Cheng, H.; Min, S.; Ammanabrolu, P.; Ostendorf, M.; Hajishirzi, H. InSCIt: Information-Seeking Conversations with Mixed-Initiative Interactions. Trans. Assoc. Comput. Linguist. 2023, 11, 453–468. [Google Scholar] [CrossRef]
- Demner-Fushman, D.; Kohli, M.D.; Rosenman, M.B.; Shooshan, S.E.; Rodriguez, L.; Antani, S.; Thoma, G.R.; McDonald, C.J. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 2016, 23, 304–310. [Google Scholar] [CrossRef]
- Steinberger, R.; Pouliquen, B.; Widiger, A.; Ignat, C.; Erjavec, T.; Tufiş, D.; Varga, D. The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, 22–28 May 2006. [Google Scholar]
- Petroni, F.; Piktus, A.; Fan, A.; Lewis, P.; Yazdani, M.; De Cao, N.; Thorne, J.; Jernite, Y.; Karpukhin, V.; Maillard, J.; et al. KILT: A Benchmark for Knowledge Intensive Language Tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 2523–2544. [Google Scholar] [CrossRef]
- Paperno, D.; Kruszewski, G.; Lazaridou, A.; Pham, N.Q.; Bernardi, R.; Pezzelle, S.; Baroni, M.; Boleda, G.; Fernández, R. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1525–1534. [Google Scholar] [CrossRef]
- Salemi, A.; Mysore, S.; Bendersky, M.; Zamani, H. LaMP: When Large Language Models Meet Personalization. arXiv 2023, arXiv:2304.11406. [Google Scholar] [CrossRef]
- Guha, N.; Nyarko, J.; Ho, D.E.; Ré, C.; Chilton, A.; Narayana, A.; Chohlas-Wood, A.; Peters, A.; Waldon, B.; Rockmore, D.N.; et al. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. arXiv 2023, arXiv:2308.11462. [Google Scholar] [CrossRef]
- Shuster, K.; Urbanek, J.; Dinan, E.; Szlam, A.; Weston, J. Deploying Lifelong Open-Domain Dialogue Learning. arXiv 2020, arXiv:2008.08076. [Google Scholar] [CrossRef]
- Ben Abacha, A.; Agichtein, E.; Pinter, Y.; Demner-Fushman, D. Overview of the Medical Question Answering Task at TREC 2017 LiveQA. In Proceedings of the Text REtrieval Conference, Gaithersburg, MD, USA, 15–17 November 2017. [Google Scholar]
- Lyft_2021. 2021. Available online: https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf (accessed on 14 May 2025).
- Yue, X.; Ni, Y.; Zhang, K.; Zheng, T.; Liu, R.; Zhang, G.; Stevens, S.; Jiang, D.; Ren, W.; Sun, Y.; et al. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. arXiv 2023, arXiv:2311.16502. [Google Scholar] [CrossRef]
- Lu, P.; Bansal, H.; Xia, T.; Liu, J.; Li, C.; Hajishirzi, H.; Cheng, H.; Chang, K.W.; Galley, M.; Gao, J. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. arXiv 2023, arXiv:2310.02255. [Google Scholar] [CrossRef]
- MTsample. Available online: https://mtsamples.com/ (accessed on 14 May 2025).
- Abacha, A.B.; Mrabet, Y.; Sharp, M.; Goodwin, T.R.; Shooshan, S.E.; Demner-Fushman, D. Bridging the Gap Between Consumers’ Medication Questions and Trusted Answers. Stud. Health Technol. Inform. 2019, 264, 25–29. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X.; Tian, C.; Yang, X.; Chen, L.; Li, Z.; Petzold, L.R. AlpaCare:Instruction-tuned Large Language Models for Medical Application. arXiv 2023, arXiv:2310.14558. [Google Scholar] [CrossRef]
- Pal, A.; Umapathi, L.K.; Sankarasubbu, M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. In Proceedings of the Conference on Health, Inference, and Learning, Virtual, 7–8 April 2022. [Google Scholar]
- Jin, D.; Pan, E.; Oufattole, N.; Weng, W.H.; Fang, H.; Szolovits, P. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Appl. Sci. 2021, 11, 6421. [Google Scholar] [CrossRef]
- Zhang, Y.; Dai, H.; Kozareva, Z.; Smola, A.; Song, L. Variational Reasoning for Question Answering with Knowledge Graph. Proc. AAAI Conf. Artif. Intell. 2018, 32, 6069–6076. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312. [Google Scholar] [CrossRef]
- Dolan, B.; Quirk, C.; Brockett, C. Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources. In Proceedings of the COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, 23–27 August 2004; pp. 350–356. [Google Scholar]
- Chen, D.; Dolan, W. Collecting Highly Parallel Data for Paraphrase Evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics, Portland, OR, USA, 19–24 June 2011; pp. 190–200. [Google Scholar]
- Xu, J.; Mei, T.; Yao, T.; Rui, Y. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5288–5296. [Google Scholar]
- Johnson, A.E.W.; Pollard, T.J.; Greenbaum, N.R.; Lungren, M.P.; Deng, C.y.; Peng, Y.; Lu, Z.; Mark, R.G.; Berkowitz, S.J.; Horng, S. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv 2019, arXiv:1901.07042. [Google Scholar] [CrossRef]
- Minecraft Wiki. Available online: https://minecraft.wiki/ (accessed on 14 May 2025).
- Sen, P.; Aji, A.F.; Saffari, A. Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 1604–1619. [Google Scholar]
- Liu, Y.; Duan, H.; Zhang, Y.; Li, B.; Zhang, S.; Zhao, W.; Yuan, Y.; Wang, J.; He, C.; Liu, Z.; et al. MMBench: Is Your Multi-modal Model an All-around Player? arXiv 2023, arXiv:2307.06281. [Google Scholar] [CrossRef]
- Fang, Y.; Liang, X.; Zhang, N.; Liu, K.; Huang, R.; Chen, Z.; Fan, X.; Chen, H. Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models. arXiv 2023, arXiv:2306.08018. [Google Scholar] [CrossRef]
- Austin, J.; Odena, A.; Nye, M.; Bosma, M.; Michalewski, H.; Dohan, D.; Jiang, E.; Cai, C.; Terry, M.; Le, Q.; et al. Program Synthesis with Large Language Models. arXiv 2021, arXiv:2108.07732. [Google Scholar] [CrossRef]
- MovieLens. 1998. Available online: https://grouplens.org/datasets/movielens/ (accessed on 14 May 2025).
- Boecking, B.; Usuyama, N.; Bannur, S.; Castro, D.C.; Schwaighofer, A.; Hyland, S.; Wetscherek, M.; Naumann, T.; Nori, A.; Alvarez-Valle, J.; et al. Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. arXiv 2022, arXiv:2204.09817. [Google Scholar] [CrossRef]
- Eric, M.; Goel, R.; Paul, S.; Sethi, A.; Agarwal, S.; Gao, S.; Kumar, A.; Goyal, A.; Ku, P.; Hakkani-Tur, D. MultiWOZ 2.1: A Consolidated Multi-Domain Dialogue Dataset with State Corrections and State Tracking Baselines. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 422–428. [Google Scholar]
- Williams, A.; Nangia, N.; Bowman, S. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 1112–1122. [Google Scholar] [CrossRef]
- Tao, W.; Wang, Y.; Shi, E.; Du, L.; Han, S.; Zhang, H.; Zhang, D.; Zhang, W. On the Evaluation of Commit Message Generation Models: An Experimental Study. arXiv 2021, arXiv:2107.05373. [Google Scholar] [CrossRef]
- Khashabi, D.; Chaturvedi, S.; Roth, M.; Upadhyay, S.; Roth, D. Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 252–262. [Google Scholar] [CrossRef]
- Fu, C.; Chen, P.; Shen, Y.; Qin, Y.; Zhang, M.; Lin, X.; Yang, J.; Zheng, X.; Li, K.; Sun, X.; et al. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv 2023, arXiv:2306.13394. [Google Scholar] [CrossRef]
- Lin, X.V.; Wang, C.; Zettlemoyer, L.; Ernst, M.D. NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Agarwal, M.; Chakraborti, T.; Fu, Q.; Gros, D.; Lin, X.V.; Maene, J.; Talamadupula, K.; Teng, Z.; White, J. NeurIPS 2020 NLC2CMD Competition: Translating Natural Language to Bash Commands. arXiv 2021, arXiv:2103.02523. [Google Scholar] [CrossRef]
- Riedel, S.; Yao, L.; McCallum, A. Modeling relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
- Trischler, A.; Wang, T.; Yuan, X.; Harris, J.; Sordoni, A.; Bachman, P.; Suleman, K. NewsQA: A Machine Comprehension Dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, Vancouver, BC, Canada, 3 August 2017; pp. 191–200. [Google Scholar] [CrossRef]
- Agrawal, H.; Desai, K.; Wang, Y.; Chen, X.; Jain, R.; Johnson, M.; Batra, D.; Parikh, D.; Lee, S.; Anderson, P. nocaps: Novel object captioning at scale. arXiv 2018, arXiv:1812.08658. [Google Scholar] [CrossRef]
- Bhattacharya, D.; Aronsohn, A.; Price, J.; Lo Re, V. Hepatitis C Guidance 2023 Update: AASLD-IDSA Recommendations for Testing, Managing, and Treating Hepatitis C Virus Infection. Clin. Infect. Dis. 2023, ciad319. [Google Scholar] [CrossRef]
- Lee, K.; Chang, M.W.; Toutanova, K. Latent Retrieval for Weakly Supervised Open Domain Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July 2019–2 August 2019; pp. 6086–6096. [Google Scholar] [CrossRef]
- Marecek, L.; Anthony-Smith, M.; Mathis, A.H. Prealgebra 2e; OpenStax: Houston, TX, USA, 2020. [Google Scholar]
- OpenStreetMap Contributors. Planet Dump. 2017. Available online: https://planet.osm.org (accessed on 14 May 2025).
- Dong, Q.; Wan, X.; Cao, Y. ParaSCI: A Large Scientific Paraphrase Dataset for Longer Paraphrase Generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; pp. 424–434. [Google Scholar] [CrossRef]
- PubMed Central (PMC) Full-Text Articles. Available online: https://www.ncbi.nlm.nih.gov/pmc/ (accessed on 14 May 2025).
- Li, Y.; Du, Y.; Zhou, K.; Wang, J.; Zhao, X.; Wen, J.R. Evaluating Object Hallucination in Large Vision-Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 292–305. [Google Scholar] [CrossRef]
- Smith, S.; Patwary, M.; Norick, B.; LeGresley, P.; Rajbhandari, S.; Casper, J.; Liu, Z.; Prabhumoye, S.; Zerveas, G.; Korthikanti, V.; et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. arXiv 2022, arXiv:2201.11990. [Google Scholar] [CrossRef]
- Lewis, P.; Wu, Y.; Liu, L.; Minervini, P.; Küttler, H.; Piktus, A.; Stenetorp, P.; Riedel, S. PAQ: 65 Million Probably-Asked Questions and What You Can Do with Them. Trans. Assoc. Comput. Linguist. 2021, 9, 1098–1115. [Google Scholar] [CrossRef]
- Wagner, P.; Strodthoff, N.; Bousseljot, R.D.; Kreiseler, D.; Lunze, F.I.; Samek, W.; Schaeffter, T. PTB-XL, a large publicly available electrocardiography dataset. Sci. Data 2020, 7, 154. [Google Scholar] [CrossRef] [PubMed]
- Strodthoff, N.; Mehari, T.; Nagel, C.; Aston, P.J.; Sundar, A.; Graff, C.; Kanters, J.K.; Haverkamp, W.; Dössel, O.; Loewe, A.; et al. PTB-XL+, a comprehensive electrocardiographic feature dataset. Sci. Data 2023, 10, 279. [Google Scholar] [CrossRef]
- Ge, T.; Hu, J.; Wang, L.; Wang, X.; Chen, S.Q.; Wei, F. In-context Autoencoder for Context Compression in a Large Language Model. arXiv 2023, arXiv:2307.06945. [Google Scholar] [CrossRef]
- Valerio Miceli Barone, A.; Sennrich, R. A parallel corpus of Python functions and documentation strings for automated code documentation and code generation. arXiv 2017, arXiv:1707.02275. [Google Scholar] [CrossRef]
- Bahrami, M.; Shrikanth, N.C.; Ruangwan, S.; Liu, L.; Mizobuchi, Y.; Fukuyori, M.; Chen, W.P.; Munakata, K.; Menzies, T. PyTorrent: A Python Library Corpus for Large-scale Language Models. arXiv 2021, arXiv:2110.01710. [Google Scholar] [CrossRef]
- Anantha, R.; Vakulenko, S.; Tu, Z.; Longpre, S.; Pulman, S.; Chappidi, S. Open-Domain Question Answering Goes Conversational via Question Rewriting. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 520–534. [Google Scholar] [CrossRef]
- Rogers, A.; Kovaleva, O.; Downey, M.; Rumshisky, A. Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks. Proc. AAAI Conf. Artif. Intell. 2020, 34, 8722–8731. [Google Scholar] [CrossRef]
- Pang, R.Y.; Parrish, A.; Joshi, N.; Nangia, N.; Phang, J.; Chen, A.; Padmakumar, V.; Ma, J.; Thompson, J.; He, H.; et al. QuALITY: Question Answering with Long Input Texts, Yes! arXiv 2021, arXiv:2112.08608. [Google Scholar] [CrossRef]
- Tafjord, O.; Gardner, M.; Lin, K.; Clark, P. QuaRTz: An Open-Domain Dataset of Qualitative Relationship Questions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5941–5946. [Google Scholar] [CrossRef]
- Choi, E.; He, H.; Iyyer, M.; Yatskar, M.; Yih, W.t.; Choi, Y.; Liang, P.; Zettlemoyer, L. QuAC: Question Answering in Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October 2018–4 November 2018; pp. 2174–2184. [Google Scholar] [CrossRef]
- Hosking, T.; Lapata, M. Factorising Meaning and Form for Intent-Preserving Paraphrasing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 1405–1418. [Google Scholar] [CrossRef]
- Gupta, A.; Agarwal, A.; Singh, P.; Rai, P. A Deep Generative Framework for Paraphrase Generation. Proc. AAAI Conf. Artif. Intell. 2018, 32, 5149–5156. [Google Scholar] [CrossRef]
- Lai, G.; Xie, Q.; Liu, H.; Yang, Y.; Hovy, E. RACE: Large-scale ReAding Comprehension Dataset From Examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 785–794. [Google Scholar] [CrossRef]
- ParticleMedia. RAGTruth. Available online: https://github.com/ParticleMedia/RAGTruth (accessed on 14 May 2025).
- Zhang, S.; Liu, X.; Liu, J.; Gao, J.; Duh, K.; Van Durme, B. ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension. arXiv 2018, arXiv:1810.12885. [Google Scholar] [CrossRef]
- Gehman, S.; Gururangan, S.; Sap, M.; Choi, Y.; Smith, N.A. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 3356–3369. [Google Scholar] [CrossRef]
- Völske, M.; Potthast, M.; Syed, S.; Stein, B. TL;DR: Mining Reddit to Learn Automatic Summarization. In Proceedings of the Workshop on New Frontiers in Summarization, Copenhagen, Denmark, 7 September 2017; pp. 59–63. [Google Scholar] [CrossRef]
- Lin, B.Y.; Wu, Z.; Yang, Y.; Lee, D.H.; Ren, X. RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 1504–1515. [Google Scholar] [CrossRef]
- Ebner, S.; Xia, P.; Culkin, R.; Rawlins, K.; Van Durme, B. Multi-Sentence Argument Linking. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8057–8077. [Google Scholar] [CrossRef]
- Lu, Y.; Liu, S.; Zhang, Q.; Xie, Z. RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model. arXiv 2023, arXiv:2308.05345. [Google Scholar] [CrossRef]
- Gliwa, B.; Mochol, I.; Biesek, M.; Wawer, A. SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, Hong Kong, China, 4 November 2019; pp. 70–79. [Google Scholar] [CrossRef]
- Ordonez, V.; Kulkarni, G.; Berg, T. Im2Text: Describing Images Using 1 Million Captioned Photographs. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Granada Congress and Exhibition Centre, Granada, Spain, 12–17 December 2011. [Google Scholar]
- Hudson, D.A.; Manning, C.D. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. arXiv 2019, arXiv:1902.09506. [Google Scholar] [CrossRef]
- Scoliosis Research Society. 1966. Available online: https://www.srs.org/ (accessed on 14 May 2025).
- Dunn, M.; Sagun, L.; Higgins, M.; Ugur Guney, V.; Cirik, V.; Cho, K. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. arXiv 2017, arXiv:1704.05179. [Google Scholar] [CrossRef]
- Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N.A.; Khashabi, D.; Hajishirzi, H. Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv 2022, arXiv:2212.10560. [Google Scholar] [CrossRef]
- Sap, M.; Rashkin, H.; Chen, D.; Le Bras, R.; Choi, Y. Social IQa: Commonsense Reasoning about Social Interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 4463–4473. [Google Scholar] [CrossRef]
- Kim, H.; Hessel, J.; Jiang, L.; West, P.; Lu, X.; Yu, Y.; Zhou, P.; Bras, R.; Alikhani, M.; Kim, G.; et al. SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 12930–12949. [Google Scholar] [CrossRef]
- Pasupat, P.; Liang, P. Compositional Semantic Parsing on Semi-Structured Tables. arXiv 2015, arXiv:1508.00305. [Google Scholar] [CrossRef]
- Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
- Alt, C.; Gabryszak, A.; Hennig, L. TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 1558–1569. [Google Scholar] [CrossRef]
- Berabi, B.; He, J.; Raychev, V.; Vechev, M.T. TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar]
- Centre for Research on the Epidemiology of Disasters (CRED); United Nations Office for Disaster Risk Reduction (UNDRR). The Human Cost of Disasters (2000–2019); UNDRR: Geneva, Switzerland, 2020. [Google Scholar]
- Kocetkov, D.; Li, R.; Ben Allal, L.; Li, J.; Mou, C.; Muñoz Ferrandis, C.; Jernite, Y.; Mitchell, M.; Hughes, S.; Wolf, T.; et al. The Stack: 3 TB of permissively licensed source code. arXiv 2022, arXiv:2211.15533. [Google Scholar] [CrossRef]
- Zhuang, Y.; Yu, Y.; Wang, K.; Sun, H.; Zhang, C. ToolQA: A Dataset for LLM Question Answering with External Tools. arXiv 2023, arXiv:2306.13304. [Google Scholar] [CrossRef]
- Adlakha, V.; Dhuliawala, S.; Suleman, K.; de Vries, H.; Reddy, S. TopiOCQA: Open-domain Conversational Question Answering with Topic Switching. Trans. Assoc. Comput. Linguist. 2022, 10, 468–483. [Google Scholar] [CrossRef]
- Voorhees, E.; Alam, T.; Bedrick, S.; Demner-Fushman, D.; Hersh, W.R.; Lo, K.; Roberts, K.; Soboroff, I.; Wang, L.L. TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection. arXiv 2020, arXiv:2005.04474. [Google Scholar] [CrossRef]
- Qian, H.; Liu, Z.; Zhang, P.; Mao, K.; Lian, D.; Dou, Z.; Huang, T. MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation. arXiv 2024, arXiv:2409.05591. [Google Scholar] [CrossRef]
- Honovich, O.; Scialom, T.; Levy, O.; Schick, T. Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 14409–14428. [Google Scholar] [CrossRef]
- Wang, X.; Wu, J.; Chen, J.; Li, L.; Wang, Y.F.; Wang, W.Y. VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. arXiv 2019, arXiv:1904.03493. [Google Scholar] [CrossRef]
- Liu, M.; Pinckney, N.; Khailany, B.; Ren, H. VerilogEval: Evaluating Large Language Models for Verilog Code Generation. arXiv 2023, arXiv:2309.07544. [Google Scholar] [CrossRef]
- Agrawal, A.; Lu, J.; Antol, S.; Mitchell, M.; Zitnick, C.L.; Batra, D.; Parikh, D. VQA: Visual Question Answering. arXiv 2015, arXiv:1505.00468. [Google Scholar] [CrossRef]
- Chang, Y.; Narang, M.; Suzuki, H.; Cao, G.; Gao, J.; Bisk, Y. WebQA: Multihop and Multimodal QA. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Ernest N. Morial Convention Center, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Shang, L.; Lu, Z.; Li, H. Neural Responding Machine for Short-Text Conversation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 26–31 July 2015; pp. 1577–1586. [Google Scholar] [CrossRef]
- Cohen, D.; Yang, L.; Croft, W.B. WikiPassageQA: A Benchmark Collection for Research on Non-factoid Answer Passage Retrieval. arXiv 2018, arXiv:1805.03797. [Google Scholar] [CrossRef]
- WikiEval. 2023. Available online: https://huggingface.co/datasets/explodinggradients/WikiEval (accessed on 14 May 2025).
- Asai, A.; Yu, X.; Kasai, J.; Hajishirzi, H. One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Online, 6–14 December 2021. [Google Scholar]
- Sakaguchi, K.; Le Bras, R.; Bhagavatula, C.; Choi, Y. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. Proc. AAAI Conf. Artif. Intell. 2020, 34, 8732–8740. [Google Scholar] [CrossRef]
- Maekawa, S.; Iso, H.; Gurajada, S.; Bhutani, N. Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; pp. 5506–5521. [Google Scholar] [CrossRef]
- Tedeschi, S.; Conia, S.; Cecconi, F.; Navigli, R. Named Entity Recognition for Entity Linking: What Works and What‘s Next. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 2584–2596. [Google Scholar] [CrossRef]
- Pilehvar, M.T.; Camacho-Collados, J. WiC: The Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 1267–1273. [Google Scholar] [CrossRef]
- Liu, A.; Swayamdipta, S.; Smith, N.A.; Choi, Y. WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 6826–6847. [Google Scholar] [CrossRef]
- Asghar, N. Yelp Dataset Challenge: Review Rating Prediction. arXiv 2016, arXiv:1605.05362. [Google Scholar] [CrossRef]
- Yelp. Yelp Open Dataset. Available online: https://business.yelp.com/data/resources/open-dataset/ (accessed on 14 May 2025).
- Irwin, J.J.; Sterling, T.; Mysinger, M.M.; Bolstad, E.S.; Coleman, R.G. ZINC: A Free Tool to Discover Chemistry for Biology. J. Chem. Inf. Model. 2012, 52, 1757–1768. [Google Scholar] [CrossRef] [PubMed]




| Index | Research Question | Goal |
|---|---|---|
| RQ1 | What thematic topics have been addressed by highly cited RAG studies? | Summarises the main topics in the field, outlining the state of knowledge and identifying gaps in the literature. |
| RQ2 | What innovative methods and approaches extend the standard RAG framework? | Provides an overview of current research, assisting researchers and engineers in identifying common methodologies, existing studies, and novel approaches. |
| RQ3 | What metrics are most frequently used to evaluate the effectiveness of RAG systems? | Identifies relevant metrics to support meaningful comparative analyses, essential for benchmarking and advancing the field. |
| RQ4 | What challenges and limitations are associated with RAG techniques? | Highlights research gaps and opportunities for proposing solutions or suggesting areas for further exploration. |
| Database | Query |
|---|---|
| ACM Digital Library | Title: (retrieval AND augmented AND generation) OR Abstract: (retrieval AND augmented AND generation) |
| IEEE Xplore | (“Document Title”: retrieval augmented generation) OR (“Publication Title”: retrieval augmented generation) OR (“Abstract”: retrieval augmented generation) |
| Scopus | TITLE-ABS-KEY (retrieval AND augmented AND generation) |
| ScienceDirect | Title, abstract, keywords: retrieval AND augmented AND generation |
| DBLP | retrieval augmented generation |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Brown, A.; Roman, M.; Devereux, B. A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges. Big Data Cogn. Comput. 2025, 9, 320. https://doi.org/10.3390/bdcc9120320
Brown A, Roman M, Devereux B. A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges. Big Data and Cognitive Computing. 2025; 9(12):320. https://doi.org/10.3390/bdcc9120320
Chicago/Turabian StyleBrown, Andrew, Muhammad Roman, and Barry Devereux. 2025. "A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges" Big Data and Cognitive Computing 9, no. 12: 320. https://doi.org/10.3390/bdcc9120320
APA StyleBrown, A., Roman, M., & Devereux, B. (2025). A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges. Big Data and Cognitive Computing, 9(12), 320. https://doi.org/10.3390/bdcc9120320

