Retrieval-Augmented Generation vs. Baseline LLMs: A Multi-Metric Evaluation for Knowledge-Intensive Content

Vinayan Kozhipuram, Aparna; Shailendra, Samar; Kadel, Rajan

doi:10.3390/info16090766

Open AccessArticle

Retrieval-Augmented Generation vs. Baseline LLMs: A Multi-Metric Evaluation for Knowledge-Intensive Content

by

Aparna Vinayan Kozhipuram

^1,*,

Samar Shailendra

^2,*

and

Rajan Kadel

²

¹

School of IT and Engineering (SITE), Melbourne Institute of Technology (MIT), Sydney, NSW 2000, Australia

²

School of IT and Engineering (SITE), Melbourne Institute of Technology (MIT), Melbourne, VIC 3000, Australia

^*

Authors to whom correspondence should be addressed.

Information 2025, 16(9), 766; https://doi.org/10.3390/info16090766

Submission received: 1 July 2025 / Revised: 25 August 2025 / Accepted: 1 September 2025 / Published: 4 September 2025

Download

Browse Figures

Versions Notes

Abstract

(1) Background: The development of Generative Artificial Intelligence (GenAI) is transforming knowledge-intensive domains such as Education. However, Large Language Models (LLMs), which serve as the foundational components for GenAI tools, are trained on static datasets, often producing misleading, factually incorrect, or outdated responses. Our study explores the performance gains of Retrieval-Augmented LLMs over baseline LLMs while also identifying the trade-off opportunity between smaller-parameter LLMs augmented with user-specific data to larger parameter LLMs. (2) Methods: We experimented with four different LLMs, each with a different number of parameters, to generate outputs. These outputs were then evaluated across seven lexical and semantic metrics to identify performance trends in Retrieval-Augmented Generation (RAG)-Augmented LLMs and analyze the impact of parameter size on LLM performance. (3) Results and Discussions: We have synthesized 968 different combinations to identify this trend with the help of different LLM sizes/parameters: TinyLlama 1.1B, Mistral 7B, Llama 3.1 8B, and Llama 1 13 B. These studies were grouped into two themes: RAG-Augmented LLM percentage improvements to baseline LLMs and compelling trade-off possibilities of RAG-Augmented smaller-parameter LLMs to larger-parameter LLMs. Our experiments show that RAG-Augmented LLMs demonstrate high lexical and semantic scores relative to baseline LLMs. This offers RAG-Augmented LLMs as a compelling trade-off for reducing the number of parameters in LLMs and lowering overall resource demands. (4) Conclusions: The findings outline that by leveraging RAG-Augmented LLMs, smaller-parameter LLMs can perform better or equivalently to large-parameter LLMs, particularly demonstrating strong lexical improvements. They reduce the risks of hallucination and keep the output more contextualized, making them a better choice for knowledge-intensive content in academic and research sectors.

Keywords:

Generative AI (GenAI); Retrieval-Augmented Generation (RAG); Large Language Models (LLMs); AI in education

1. Introduction

Within knowledge-intensive applications such as Education, Large Language Models (LLMs) facilitate interactive, real-time learning experiences, enhancing the accessibility and comprehension of complex subject matters. LLMs can process and analyze large volumes of text quickly, making them ideal for tasks like class assistance and content generation [1]. In addition, LLMs demonstrate remarkable capabilities in executing tasks that demand extensive memorization of factual data [2]. However, the pre-trained data in LLM do not reflect recent events, and their memorization capabilities are constrained when dealing with less frequent subjects [3,4,5,6]. State-of-the-art LLMs encounter the well-known “hallucination” problem [7,8] and temporal degradation [9]. Retrieval-Augmented Generation (RAG) [10,11,12] has recently been recognized as an effective approach to address LLM’s deficiency in the lack of up-to-date knowledge and has drawn a lot of attention from both academic research and sectors involved in AI-driven solutions. The RAG system searches its knowledge sources to retrieve relevant information and provides grounded answers [11,13,14]. Numerous studies emphasize novel retrieval strategies (document relevance, context length, etc.) to improve LLM performance; however, they fail to provide a comprehensive comparative analysis to grasp the impact that RAG can play.

To mitigate this gap, the key contribution of this research is as follows:

Analyze text quality improvement of RAG-Augmented LLMs to baseline LLMs for knowledge-intensive content;
Evaluate the potential trade-off between RAG-Augmented smaller LLMs to larger-parameter baseline LLMs for performance efficiency and resource utilisation.

This paper is structured as follows. Section 2 shares the background studies terms used, Section 3 presents the data sources, ground truth benchmark, evaluation metrics used, and the methodology. Section 4 presents the results obtained from the comparative analysis. Finally, Section 5 concludes the paper with possible directions for future work. The abbreviations used in the manuscript are listed in the Abbreviations.

2. Background

The rapid integration of Artificial Intelligence (AI) spanning multiple sectors has been transformative, particularly through the emergence of LLMs. In exploring this evolving landscape, we extensively reviewed the existing literature to identify research gaps, particularly in the comparative performance of LLMs and RAG frameworks.

2.1. Artificial Intelligence (AI) and Large Language Model (LLM)

Research in AI has served as a persistent effort of scientists and engineers for more than half a century. LLMs are designed to process information efficiently and excel at tasks such as factual recall, pattern recognition, and language generation [15,16,17]. AI is increasingly being integrated into higher education, offering transformative opportunities at the same time while presenting significant challenges. Natural Language Processing (NLP) [18] is a core subset of AI, primarily utilizing LLMs as a tool to analyze, search information, and generate human language. They are trained on extensive datasets, enabling them to perform various language-related tasks such as translation, summarization, and question answering. They can be general-purpose or task-specific and have shown promise in multiple applications, particularly in science and medicine.

Consequently, when LLMs are intended for deployment in less-resourced domains, customization becomes imperative to ensure optimal performance. In AI-based tutoring applications, a significant constraint of LLMs is their tendency to generate hallucinated responses, wherein the model generates factually inaccurate or misleading explanations, potentially affecting the accuracy and reliability of pedagogical assistance [19]. To enhance the efficacy of LLMs in particular tasks or fields, additional customization of general-purpose LLMs is necessary. As a result of this necessity, various methodologies have been developed to improve the performance of these LLMs, including fine-tuning, Parameter-Efficient Fine-Tuning (PEFT), and prompt engineering. Fine-tuning [20] of an LLM involves training all parameters of the pre-trained LLM on a specific dataset to improve performance on particular task(s). Even though fine-tuning leads to better accuracy on particular tasks, its notable computational and memory requirements cannot be ignored. Meanwhile, PEFT [21] aims to reduce the adjusted parameters in LLMs. Faster learning performance, along with the reduced memory requirement and computational cost, makes this strategy more efficient. However, the efficiency of PEFT depends on the complexity of the tasks and specific techniques employed, as it just updates a limited number of parameters compared to fine-tuning. Prompt engineering [21] avoids retraining network weights and instead involves the careful formulation of input prompts directed towards the LLM to produce desired output. This approach includes zero-shot prompting, few-shot prompting, and chain-of-thought prompting, each offering a way to guide the LLM’s response without direct modification of its parameters. However, training and fine-tuning of these LLMs remain computationally intensive and economically challenging, especially for smaller and medium-sized organizations.

2.2. Emergence of Retrieval-Augmented Generation (RAG)

A drastic improvement in the quality of generated text across tasks such as summarization, question answering, and open-domain dialogue has been observed with recent advancements in large-scale language LLMs. RAG [12,14,22,23] is one of the most promising directions that stand out in improving the factual accuracy and contextual relevance of generated responses [24]. RAG architectures consist of two components: a retriever to retrieve relevant external information based on the input query and a generator to produce text informed by both the query and the retrieved LLM. Recent advancements have led to the emergence of novel RAG variants that improve retrieval generation alignment and contextual relevance. Self-RAG incorporates a feedback mechanism that enhances model quality and factuality through retrieval and self-reflection [25]. Similarly, the Hypothetical Document Embeddings (HyDE) method first generates a hypothetical answer based on the query. Then, it retrieves documents aligned with that generated content, improving retrieval accuracy in ambiguous or open-ended scenarios [26]. Another framework that enhances the retrieval generation interface is the Fusion-in-Decoder (FiD) architecture, which engages with all retrieved documents simultaneously during decoding, contributing to improved multi-passage integration. FiD-RAG combines this strategy with traditional RAG pipelines to improve factual consistency [27]. Unlike traditional RAG systems that retrieve from unstructured vector stores like Facebook AI Similarity Search (FAISS) [28], GraphRAG [29], and the subsequent Graph Retrieval Augmented Fine-Tuning (GRAFT) [30], innovates by fusing grounded retrieval with finetuning techniques tailored for domain adaptation. These advancements reflect a growing shift toward more adaptive and semantically guided retrieval techniques.

Beyond architectural improvements, RAG has also found increasing utility in domain-specific applications, particularly in legal and healthcare sectors. In the legal domain, RAG models facilitate the retrieval of relevant case law, statutes, and legal precedents, enabling context-aware legal assistance and improving transparency in argumentation [31,32]. In healthcare, RAG has proven efficacy in the retrieval of medical literature, clinical guidelines, and evidence-based resources to support treatment recommendations and physicians’ voice recording for user confidence interaction, thereby enhancing the reliability of AI-based clinical tools [33]. The application of RAG in the context of education is an emerging field that leverages the capabilities of LLMs to elevate learning experiences in domains including computer science, cybersecurity, and general higher education [34,35,36]. Recent studies have explored various applications of RAG frameworks across multiple domains. For instance, Zhu et al. [37] focus on RAG technology, significantly enhancing data handling, clinical decision support, and patient management in Electronic Health Records. This dual approach allows for improved accuracy and richness in processing complex medical data, which is essential for effective healthcare delivery. Similarly, RAG offers a novel approach to video retrieval departing from traditional methods that rely on frame-level descriptions with a context-aware dynamic fragmenting strategy that adaptively adjusts segmentation thresholds according to the complexity and information density of the video [38]. While the integration of LLMs into educational settings offers clear benefits, it also raises important ethical concerns. In particular, concerns around transparency and learner dependency require attention. General-purpose LLMs often operate as black-box systems, offering limited insight into how responses are generated or whether they are grounded in verifiable sources. This lack of transparency can be problematic when used in instructional settings. To mitigate these risks, our approach incorporates RAG, which supports transparency by explicitly retrieving context from external documents and helps reduce dependency by anchoring responses in credible sources. Furthermore, by integrating RAG with LLMs, this research seeks to improve the overall performance in answer relevance, indicating a strategic choice to leverage RAG’s capabilities rather than focusing solely on fine-tuning or PEFT methods [39].

2.3. Evaluation Metrics

To evaluate the outputs of LLM quantitatively, a combination of lexical and semantic similarity scores, as shown in Table 1, is employed to assess the generated output along with both the exterior and deeper semantic meaning. A lexical similarity score measures the extent of overlap between the vocabularies of two texts, languages, or dialects. These metrics emphasize the surface-level attributes such as word matching or n-gram overlap. The quantitative measure lies between 0 and 1; with 0 being the least match and 1 being the highest match. Even though they can be implemented fast, they are sensitive to factors like spelling variations, punctuation, synonyms, and word order. Bilingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-1/2/L are some of the most widely utilized lexical similarity scores.

In contrast, semantic similarity explores the underlying meaning of the text, offering deeper insights into text relationships. It measures how closely two pieces of text (such as words, phrases, or sentences) align in meaning. Bidirectional Encoder Representations from Transformers (BERT)—via Precision, Recall, and F1 score—act as the best semantic similarity scores. Further elaboration of each score is outlined below.

Bilingual Evaluation Understudy (BLEU) [40] precisely measures n-gram overlap, penalizing missing or incorrect segments; these scores contribute as a standard measure in machine translation to flag when the LLM omits key phrases. On the other hand, it can reward overly short outputs, since it ignores recall.

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [41] acts as a straightforward assessment of lexical similarity. It is widely used in summarization and question answering to find how much of the reference content the LLM covers. ROGUE-1 evaluates the overlap of unigrams (single words). This score provides how much of the reference content the LLM covers. As our focus is to capture more contextual information, ROUGE-2 [42] will act as the best measure to evaluate the overlap of bigrams between the generated and reference summaries. At the same time, ROUGE-L (longest common subsequence) measures how much of the longest matching word sequence between generated and reference texts is preserved, in correct order, giving a balanced view of structure and content accuracy.

The lexical approach still operates on the surface, offering a middle-ground granularity, but it can still miss semantic equivalence. To measure semantic precision, we used BERT-Precision, a deep contextual embedding LLM, to check that the LLM’s answer is semantically correct relative to the reference. Complementing BERT-Precision, BERT-Recall measures how much of the reference meaning is captured by the answer. This ensures that the LLM not only stays accurate but also covers the full scope of the reference answer. The BERT F1 score measures the harmonic mean of semantic precision and recall, capturing both completeness and accuracy with relatively low computational complexity compared to its counterparts (e.g, sentence-BERT). This choice aligns with prior LLM evaluation studies and balances computational efficiency with robustness.

2.4. Gaps in Existing Research

Numerous comparative studies have been carried out across different RAG-based LLMs, Llama [39], Claude Sonnet [43], Gemini [44], and GPT-4 [45,46], with a focus on capabilities such as noise robustness, knowledge gap detection, and external truth integration. However, a significant research gap persists in comparing RAG-Augmented LLMs with baseline LLMs concerning parametric sizes. This could be pivotal, as it could unveil the extent to which the integration of retrieval mechanisms affects the efficiency and scalability of baseline LLMs. Furthermore, it aids in comprehending the trade-off possibilities among different LLMs, which can optimize LLM performance. This research investigates these gaps by assessing the effectiveness of RAG on parameter sizes and whether augmenting a smaller LLM with a RAG system serves as a valuable trade-off to a baseline larger LLM. The following research questions guide this investigation:

RQ1: Does RAG consistently improve output quality across LLM sizes?
RQ2: How does LLM size influence the benefits gained from retrieval?

Our study systematically compares different LLMs with different parameter sizes altogether: RAG-Augmented TinyLlama, RAG-Augmented Mistral 7B [47], RAG-Augmented Llama 3.1 8B, RAG-Augmented Llama-1 13B, TinyLlama [48], Mistral 7B [48], Llama 3.1 8B, and Llama-1 13B. Therein, we lay a solid foundation to sufficiently identify the performance trends. In this study, we focus on using small-parameter-size LLMs and observing their performance improvement with RAG. This provides a benchmark for the adoption of LLMs in a resource-constrained environment.

3. Research Methodology

This paper examines the performance of RAG-Augmented small-parameter LLMs in comparison to large-parameter baseline models using multiple text similarity metrics. Our research methodology comprises three distinct phases: Foundation, Implementation, and Analysis; these are illustrated in Figure 1. The subsequent subsections provide the details about these phases.

3.1. Foundation

This subsection provides details on data creation and the ground truth benchmark to establish a foundational level of accuracy and reliability of the dataset. A ground truth benchmark provides a reference point against which the quality and validity of LLMs’ outputs can be evaluated. The following section details the methodology used for creating data and defining clear benchmarks, offering a preliminary framework to improve overall outcomes.

3.1.1. Data Creation

We emphasized the significance of a credible dataset that demonstrates the efficacy of the RAG framework. Data were retrieved from the academic publication named “Human Nutrition: 2020 Edition” from the University of Hawai at Mānoa, which is recognized as a reliable and comprehensive source of nutritional knowledge. Given its open-access format, it assures that the data are publicly available and appropriately licensed for research purposes. The “Human Nutrition: 2020 Edition” textbook provides comprehensive coverage of essential nutrients, dietary guidelines, and nutrition across the lifespan. Such foundational information remains pertinent for understanding basic nutritional principles and their application. This textbook serves as the Open Educational Resource (OER) for the FSHN 185 “The Science of Human Nutrition” course at the University of Hawai at Mānoa. OER materials are often subject to peer review and are designed to meet educational standards, ensuring the reliability and quality of the information presented. While the data are sourced from 2020, the fundamental principles of human nutrition remain consistent over time. Thus, utilizing these data is appropriate for evaluating the performance of RAG LLMs in answering nutrition-related questions. We used the PDF format of the textbook to integrate into the RAG framework. Upon curation of the data, our focus was on formulating a set of queries directed to LLM, alongside the corresponding ground truth, for reference while evaluating the performance of LLM and RAG.

3.1.2. Ground Truth Benchmark

In this section, we elaborate on the development of a ground truth benchmark designed to evaluate the quality of automated responses generated from LLMs. Given the high costs associated with manually creating human-written queries and answers and the inability to use ground truth benchmarks from other domains, we adopted a semi-automatic approach to generate the benchmark. We extracted meaningful passages from the “Human Nutrition: 2020 Edition” textbook to formulate queries and answers based on these passages. To ensure the quality of the generated benchmark, we conducted a comparison of the queries and answers against the passages (i.e., the context from which the question and answers originated). These were reviewed by two human subject-matter experts for factual correctness and contextual alignment with the source material. The ground truth benchmark we established consists of a set of triples containing a query, an answer, and the passage. Finally, we utilized the formulated ground truth benchmark to evaluate different LLM outputs.

3.2. Implementation

The second tier of the methodology framework focuses on baseline LLM selection criteria, followed by detailed insights on the RAG-Augmentation mechanism. Finally, the evaluation metrics used to quantify the lexical and semantic similarity score of the LLM outputs are presented. Table 2 summarizes the evaluated LLMs, including model family, parameter count, minimum GPU memory (VRAM) requirement [49,50], and strengths and limitations. The following subsections provide the experimental setup and evaluation approach.

3.2.1. LLM Architecture

This subsection provides the strengths and limitations of baseline LLMs, alongside the parameter sizes and the technical architecture of the RAG mechanism implemented.

Baseline LLMs

While recognizing the relevance and widespread adoption of larger models such as GPT-3.5 and GPT-4, their closed-source nature restricts opportunities for direct experimentation and comparative analysis. Consequently, we have selected four baseline LLMs following parametric sizes, which share similar transformer-based architectures (Table 2). The selection of these LLMs aligns with our objective of promoting ethical AI practices and enhancing accessibility in educational settings. We achieved this goal by selecting compact LLMs that can run locally, as opposed to large LLMs that rely on cloud computing. Likewise, we selected TinyLlama 1.1B, Mistral 7B, Llama 3.1 8B, and Llama 1 13B.

RAG-Augmented LLMs

This study proposes implementing baseline LLMs with a mechanism referred to as RAG. A simple RAG mechanism combining retrieval mechanisms with generative LLMs is used to improve the accuracy and relevance of the outputs. RAG supports plain text, PDF files, and websites as external database sources. Therefore, for our study, we relied on the PDF version of the nutrition book.

The workflow of the RAG-Augmented LLM (Figure 2) is structured into three sections: Data Preprocessing and Indexing, Contextual Retrieval, and Content Generation. Each of these stages plays a distinct role in enabling effective retrieval-augmented generation and is examined in detail below to illustrate the complete operational workflow of the system. They are described in what follows:

(a): Data Preprocessing and Indexing

Initially, the URL to the PDF version of the “Human Nutrition book” was incorporated into the RAG framework. For data processing, the text was tokenized into smaller chunks, such as sentences and words, to facilitate efficient indexing and retrieval. Each chunk was then converted into embeddings or vectors to capture the semantic meaning using the sentence transformer LLM “all-MiniLM-L6-v2” [51]. The all-MiniLM-L6-v2 model was selected for its strong trade-off between accuracy and speed, achieving nearly the same retrieval performance as BGE-Large and instructor-tuned variants while being faster and significantly smaller in size [52]. This efficiency made it more suitable for large-scale, multi-query RAG evaluations without compromising output quality. These embeddings were stored in the vector database along with metadata and the original text, thus assuring compatibility with the vector-based search engine utilized in the production stage [53].

(b): Contextual Retrieval

FAISS [28] vector database was utilised to efficiently manage and retrieve the vast amount of preprocessed data. The FAISS search engine supports the classification of these vectors into an index that ensures quick and accurate retrieval based on a query. The indexing algorithm ensured that similar documents were grouped, enhancing the retrieval accuracy. This configuration enabled proficient searching and retrieval of relevant documents. Once we input a query, the top five passages are retrieved, which are ranked by their similarity scores [54].

(c): Content Generation

The retrieved top five passages are forwarded to LLMs to generate responses augmented with external information. Here, we implement a LangChain-style RAG pipeline, wherein a query is simultaneously directed to both the retriever and the LLM. The retriever fetches relevant context and combines it with the query to create a well-structured prompt for the LLM. The LLM then formulates a response based on the given prompt, which is passed through the output parser to return a clean string, ensuring the final answer is plain text. RAG-Augmentation enhanced the LLM’s ability to generate responses more accurately and contextually relevant. The modified architecture leveraged the strengths of both the LLM and the RAG framework, resulting in a more robust and reliable language LLM.

For each of the four baseline and RAG-Augmented LLMs, we generated the output. Specifically, each LLM was evaluated using 11 carefully selected queries, designed to span a range of knowledge-intensive tasks relevant to the educational domain, including factual recall, conceptual explanation, and domain-specific reasoning. The number 11 was chosen to provide sufficient diversity without introducing evaluation fatigue or overwhelming variability, thereby striking a practical balance between depth and breadth of testing. For every LLM, a set of 11 queries was executed 11 times, generating 121 unique outputs per LLM to account for output variability and ensure reproducibility. To maintain consistency and comparability, identical prompts were used across all LLMs. In total, 968 outputs were collected (8 LLMs × 11 queries × 11 iterations). These outputs were subsequently evaluated using seven distinct evaluation metrics, forming the base for the study’s analysis and findings.

3.2.2. Evaluation Criteria

To evaluate each LLM quantitatively, we utilized the lexical and semantic similarity score (Table 1). The combination of these metrics was used to obtain a surface-level evaluation, as well as the exterior and deeper semantic meaning.

Lexical Similarity Score

In measuring lexical similarity score, we utilized BLEU and ROUGE-1/2/L, which are referred in Table 1, to measure the extent of overlap between the vocabularies of two texts.

Semantic Similarity Score

In contrast, semantic similarity scores utilized Bert-Precision, Recall, and F1 score, as referred in Table 1, to explore the underlying meaning of the text, offering deeper insights into text relationships.

Both metrics used in this evaluation provide a multi-faceted evaluation to detect both superficial and substantive differences in LLM outputs and properly assess the impact of RAG on reducing hallucinations and improving answer quality.

3.3. Analysis

In this paper, we measured and compared the performance of RAG-Augmented LLMs and Baseline LLMs. We calculated the mean of the evaluation metrics (Table 1) for each LLM output. To assess the reliability of these estimates, we also computed a 90% confidence interval, indicating a 90% probability that the true value falls within this range.

4. Results and Discussion

This section outlines the results of lexical and similarity metrics for the text generated by the baseline LLMs and RAG-Augmented LLMs for the user queries. We have run these tests on Python v3.12.3 using an Intel i9 processor with 64 GB RAM and Nvidia RTX 4090 GPU, as well as Google Colab Pro (4 CPU, 13 GB RAM with Nvidia T4 Tensor Core GPU). Furthermore, for these results, a single query was executed 11 times to capture 11 LLM outputs and evaluated using seven evaluation metrics. Consequently, we performed the same analysis with 11 distinct queries. Overall, for each evaluation metric, we gathered a mean of 121 results per LLM, facilitating a statistical analysis of LLM performance. This type of analysis provides a comprehensive understanding of LLM performance and lays the groundwork for the development of more effective and reliable LLMs. We have used the research work and publications carried out in the last month, i.e., later than the training cutoff date of the LLMs. The results have been categorized into two sections.

The first section compared the baseline LLMs with RAG-Augmented LLMs, highlighting the LLM performance improvement while using RAG. In the second section, a comparative analysis was conducted to list the trade-off possibilities of a small RAG-Augmented LLM compared to larger-parameter baseline LLMs. By utilizing four LLMs generating 121 outputs per LLM, our study is a comprehensive assessment of 968 LLM outputs evaluated across seven evaluation metrics.

4.1. Comparative Analysis: RAG-Augmented LLM vs. Baseline LLM

In this section, we analyze how RAG-Augmented LLMs, compared to baseline LLMs, use the lexical and semantic similarity scores mentioned in Table 1.

Figure 3 represents the absolute scores achieved by each LLM, whereas Table 3 summarizes the percentage improvement achieved through RAG-Augmentation on lexical and semantic scores. Higher lexical similarity scores signify that the RAG-generated text is more fluent and has captured the important aspects of the original text. At the same time, a higher semantic similarity score indicates that RAG-Augmented text is semantically closer to the actual text, i.e., it conveys the context and meaning better than the baseline LLM. We observed that RAG-Augmented LLMs consistently enhanced every lexical metric, including BLEU and ROUGE-1/2/L. Llama 3.1 8B achieved the highest benefits for the BLEU score, indicating that the baseline LLM often paraphrases the content, whereas the RAG-Augmented LLM effectively retrieves nouns directly from the retrieved passages instead of paraphrasing them. Similarly, Llama 1 13B, which possesses enough capacity to recall many terms unaided, resulted in a small lexical gap, yet RAG-Augmentation uplifted the BLEU score. In the case of Mistral 7B, the baseline LLM tended to prioritize concise, on-point phrasing, leading to relatively high lexical overlap. Nonetheless, retrieval provided marginal gains, inserting the last missing nouns and nudging the BLEU upwards. It is important to note that RAG can still improve the LLM performance, generally observing better gains for lower parameter LLMs. While considering the ROUGE-L score, Llama 3.1 8B once again benefited from RAG-Augmentation by extracting longer, sequential segments from the retrieved passages to refine LLM output through restructuring of the text. Similar patterns were observed across TinyLlama, Mistral 7 Band, and Llama 1 13B. In conclusion, these findings suggest that retrieval provides significant percentage improvements when the baseline LLM exhibits weak lexical performance.

Secondly, we conducted a comparative analysis of semantic similarity scores across the LLMs. RAG-Augmentation consistently improved the semantic similarity score—Bert-Recall and F1 score—across all LLMs (Figure 4). Among them, Mistral 7B achieved the highest gain in semantic similarity. However, the magnitude of this improvement remained relatively modest. This suggests that both the baseline and RAG-Augmented LLM outputs already capture the deep meaning of the content, while retrieval refines the phrases rather than enriching meaning. Similar trends were observed in TinyLlama 1.1B, Llama 3.1 8 B, and Llama 1 13B (Table 3). These findings imply that while RAG contributes to semantic enhancements, its most significant role may lie in improving lexical precision rather than meaning reconstruction when the baseline LLM already exhibits strong semantic understanding.

In summary, retrieval augmentations yield advantages in applications where lexical precision, such as quotable answers, closed-book exams, and fact-checking. Importantly, the magnitude of gains depends on the baseline LLM performance; weaker performing LLMs benefit most from RAG-Augmentation, as retrieval supplements with nouns matching the reference texts. Conversely, as baseline LLMs’ performance outcomes increase, the relative efficiency rates of retrieval diminish, offering only incremental improvements. With the analysis of lexical and semantic similarity, we can conclude with 90% confidence that there is a significant improvement in the performance of the baseline LLM when augmented with RAG.

4.2. Comparing RAG-Augmented Small LLMs with Large-Scale Baseline LLMs

In this section, we compare smaller RAG-Augmented LLMs with larger-parameter LLMs using the evaluation metrics mentioned in Table 1 to explore the impact and trade-off between using RAG vs increasing the LLM parameters. A comparative analysis was conducted considering three cases:

Baseline Llama 3.1 8B to RAG-Augmented Mistral 7B;
Baseline Llama 1 13B to RAG-Augmented Mistral 7B;
Baseline Llama 1 13B to RAG-Augmented Llama 3.1 8B.

Case #1: Baseline Llama 3.1 8B to RAG-Augmented Mistral 7B

In this comparison, the RAG-Augmented Mistral 7B was evaluated against a different baseline LLM—Llama 3.1 8B—to explore the potential trade-off possibilities across parameter scale. Our comparative analysis, as shown in Figure 5, indicates that RAG-Augmented Mistral 7B acts as an effective trade-off to Llama 3.1 8B. The performance improvement of RAG-Augmented Mistral 7B over Llama 3.1 8B is presented in Table 4.

RAG-Augmented Mistral-7B retrieves nearly verbatim phrases from the source text, resulting in high n-gram overlap, thereby significantly elevating the BLEU score. Similarly, the ROUGE score signifies an increase in one, two, and long-sequence overlap as the RAG retrieves larger segments. On the contrary, the baseline Llama-3.1 8B demonstrated minimal direct word overlap with the reference text, resulting in a low lexical similarity score. While the semantic similarity exhibited modest gains with RAG, this improvement primarily reflects refinement in phrasing rather than significant changes in meaning, indicating that the LLM output already conveyed the core concept. In conclusion, if the primary focus is semantic fidelity alone, RAG-Augmentation offers only moderate benefits, while its impact on lexical accuracy is substantial. Therefore, for tasks requiring high lexical precision, RAG-Augmented Mistral 7B presents a compelling and resource-efficient alternative to larger Llama 3.1 8B LLM.

Case #2: Baseline Llama 1 13B to RAG-Augmented Mistral 7B

In this comparison, the RAG-Augmented Mistral 7B was evaluated against a different baseline LLM—Llama 1 13B—to explore the potential trade-off possibilities across parameter scale. This comparison reveals that RAG-Augmented Mistral 7B outperformed Llama 1 13B in both lexical and semantic similarity scores, as in Figure 6 and Table 5. When compared to Llama 1 13B, RAG-Augmented Mistral 7B showed a notable improvement in BLEU and ROUGE scores, indicating a stronger n-gram overlap. While improvement in the BERT Recall and F1 scores came out to 8.4%, indicating that both LLMs capture the core concepts of the target responses, RAG-Augmented Mistral 7B refines the LLM responses, resulting in more polished and coherent outputs.

Case #3: Baseline Llama 1 13B to RAG-Augmented Llama 3.1 8B

In this comparison, the RAG-Augmented Llama 3.1 8B was evaluated against the same baseline LLM—Llama 1 13B—to explore the trade-off possibilities across parameter scale. We also found RAG-Augmented Llama 3.1 B gain improvement compared to baseline Llama 1 13B, with improvement in a lexical and semantic similarity score, as shown in Figure 7. This indicates that RAG-Augmentations helps as a real trade-off to larger-parameter baseline LLMs by increasing the n-gram overlap by reusing phrases from the retrieved contents, as well as integrating contextually accurate information (Table 6). Nevertheless there was a slight decrease in BERT Precision for RAG-Augmented Llama 3.1 8B, reflecting the strong semantic capacity already embedded in the larger Llama 1 13B due to substantial internalized knowledge and contextual understanding. RAG-Augmented Llama 3.1 8B still can be considered as a viable trade-off to Llama 1 13B, especially if the goal is to reduce resource requirements while maintaining strong performance.

It is also interesting to note that with RAG-Augmented LLMs, the lexical score significantly improved due to improved context provided by the RAG, even though the output from both RAG-Augmented and baseline LLMs had closer semantic similarity, as indicated by the improvement in BERT scores.

Overall, we can provide the following insights based on the two analyses: for LLMs under 10 billion parameters, it is advisable to consistently incorporate retrieval, as the increase in quality surpasses that of transitioning to the subsequent dense tier. In scenarios where the GPU budget is constrained, it is recommended to select Mistral combined with RAG instead of any standard 13 billion LLM, as it represents the optimal cost-to-quality ratio according to our evaluation. Compared with a larger model such as Llama 1 13B, which uses 13 GB RAM, RAG-Augmented Mistral 7B and Llama 3.1 8B achieve comparable performance with lower memory (7–8 GB) requirement, demonstrating a practical quality–compute trade-off. For minimal resource requirements with acceptable quality, utilize RAG implemented with Mistral 7B for a balanced approach between efficiency and high quality, and for high-resource capacity with near-maximum quality, implement Llama-13B while maintaining retrieval for tasks where recall is critical. It is interesting to note that retrieval enables smaller LLMs to ascend in output quality without necessitating an increase in LLM parameters, in turn limiting the resource requirements.

5. Conclusions and Future Directions

In this paper, we examined the performance of RAG-Augmented LLMs in comparison to baseline LLMs across a range of parameter sizes for knowledge-intensive content while also exploring the potential trade-off between RAG-Augmented smaller LLMs and larger-parameter LLMs. Our findings demonstrate that RAG-Augmentation significantly enhances the performance of baseline LLMs across both lexical and semantic similarity scores, with notable trade-off opportunities between LLM size. Lexical similarity metrics such as BLEU and ROUGE-1/2/L show consistent improvement, particularly in smaller LLMs where retrieval supplements missing vocabulary and structure. For instance, BLEU scores improved by 27.70% for TinyLlama and 60.50% for Mistral 7B, as well as 216.90% for Llama 3.1 8B, while ROUGE-L gains reached 30.50% for TinyLlama, 28% for Mistral 7B, and 145.10% for Llama 3.1 8B, highlighting RAG’s effectiveness in enhancing lexical precision and sequence alignment. RAG-Augmented LLMs benefit from retrieval’s ability to fill lexical gaps by enhancing factual grounding, ensuring that responses are both contextually rich and supported by verifiable information. While semantic improvements, measured through BERT-based metrics, remain modest, they indicate that RAG primarily enhances surface-level accuracy and clarity. Overall, RAG proves especially effective for smaller LLMs, offering a cost-efficient alternative to larger architectures without sacrificing output quality, making it an ideal solution for knowledge-intensive applications.

Future work should explore domain-specific retrieval techniques and efficiency optimisations for real-time applications. Additionally, integrating human evaluation and cross-lingual analysis can further validate and expand the applicability of RAG-Augmented LLMs. This study uses a single domain-specific source, which may limit generalizability. Future research should incorporate broader benchmarks across multiple domains to strengthen its robustness. Different RAG models (e.g., GraphRAG, HyDE, etc.) and evaluation scores can be used to gain more insights into the performance and potential improvements in the contextual text generation. Incorporating human evaluation would strengthen the assessment framework and is identified as a priority for future research.

Author Contributions

Conceptualization, A.V.K., S.S., and R.K.; methodology, A.V.K., S.S., and R.K.; software, A.V.K. and S.S.; validation, A.V.K., S.S., and R.K.; formal analysis, A.V.K., S.S., and R.K.; investigation, A.V.K., S.S., and R.K.; resources, A.V.K.; data curation, A.V.K.; writing—original draft preparation, A.V.K.; writing—review and editing, S.S. and R.K.; supervision, S.S. and R.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
BERT	Bidirectional Encoder Representations from Transformers
BLEU	Bilingual Evaluation Understudy
FAISS	Facebook AI Similarity Search
FiD	Fusion-in-Decoder
GenAI	Generative Artificial Intelligence
GRAFT	Graph Retrieval Augmented Fine-Tuning
HyDE	Hypothetical Document Embeddings
LLM	Large Language Model
NLP	Natural Language Processing
OER	Open Educational Resource
PEFT	Parameter-Efficient Fine-Tuning
RAG	Retrieval Augmented Generation
ROUGE	Recall-Oriented Understudy for Gisting Evaluation

References

Yan, L.; Sha, L.; Zhao, L.; Li, Y.; Martinez-Maldonado, R.; Chen, G.; Li, X.; Jin, Y.; Gašević, D. Practical and ethical challenges of large language models in education: A systematic scoping review. Br. J. Educ. Technol. 2024, 55, 90–112. [Google Scholar] [CrossRef]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
Gerritse, E.J.; Hasibi, F.; de Vries, A.P. Entity-aware transformers for entity search. In Proceedings of the 45th International ACM Sigir Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 1455–1465. [Google Scholar]
Kandpal, N.; Deng, H.; Roberts, A.; Wallace, E.; Raffel, C. Large language models struggle to learn long-tail knowledge. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 15696–15707. [Google Scholar]
Mallen, A.; Asai, A.; Zhong, V.; Das, R.; Khashabi, D.; Hajishirzi, H. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 9802–9822. [Google Scholar] [CrossRef]
Sun, K.; Xu, Y.; Zha, H.; Liu, Y.; Dong, X.L. Head-to-Tail: How Knowledgeable are Large Language Models (LLMs)? A.K.A. Will LLMs Replace Knowledge Graphs? In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; Duh, K., Gomez, H., Bethard, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 311–325. [Google Scholar] [CrossRef]
Dhingra, B.; Cole, J.R.; Eisenschlos, J.M.; Gillick, D.; Eisenstein, J.; Cohen, W.W. Time-Aware Language Models as Temporal Knowledge Bases. Trans. Assoc. Comput. Linguist. 2022, 10, 257–273. [Google Scholar] [CrossRef]
Shuster, K.; Poff, S.; Chen, M.; Kiela, D.; Weston, J. Retrieval Augmentation Reduces Hallucination in Conversation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 3784–3803. [Google Scholar] [CrossRef]
Kasai, J.; Sakaguchi, K.; Takahashi, Y.; Le Bras, R.; Asai, A.; Yu, X.; Radev, D.; Smith, N.A.; Choi, Y.; Inui, K. Realtime qa: What’s the answer right now? Adv. Neural Inf. Process. Syst. 2023, 36, 49025–49043. [Google Scholar]
Chen, Z.; Gu, Z.; Cao, L.; Fan, J.; Madden, S.; Tang, N. Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes. In Proceedings of the CIDR, Amsterdam, The Netherlands, 8–11 January 2023; pp. 1–7. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
de Luis Balaguer, M.A.; Benara, V.; de Freitas Cunha, R.L.; Estevão Filho, R.d.M.; Hendry, T.; Holstein, D.; Marsman, J.; Mecklenburg, N.; Malvar, S.; Nunes, L.O.; et al. RAG vs. Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture. arXiv 2024, arXiv:2401.08406. [Google Scholar] [CrossRef]
Chen, W.; Hu, H.; Chen, X.; Verga, P.; Cohen, W. MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 5558–5570. [Google Scholar] [CrossRef]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Guo, Q.; Wang, M.; et al. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
Jiang, Y.; Li, X.; Luo, H.; Yin, S.; Kaynak, O. Quo vadis artificial intelligence? Discov. Artif. Intell. 2022, 2, 4. [Google Scholar] [CrossRef]
Telenti, A.; Auli, M.; Hie, B.L.; Maher, C.; Saria, S.; Ioannidis, J.P. Large language models for science and medicine. Eur. J. Clin. Investig. 2024, 54, e14183. [Google Scholar] [CrossRef]
Jeyaram, R.; Ward, R.N.; Santolini, M. Large language models recover scientific collaboration networks from text. Appl. Netw. Sci. 2024, 9, 64. [Google Scholar] [CrossRef]
Peris, C.; Dupuy, C.; Majmudar, J.; Parikh, R.; Smaili, S.; Zemel, R.; Gupta, R. Privacy in the time of language models. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, Singapore, 27 February–3 March 2023; pp. 1291–1292. [Google Scholar]
Ho, H.T.; Ly, D.T.; Nguyen, L.V. Mitigating Hallucinations in Large Language Models for Educational Application. In Proceedings of the 2024 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), Danang, Vietnam, 3–6 November 2024; pp. 1–4. [Google Scholar]
Howard, J.; Ruder, S. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; Gurevych, I., Miyao, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 328–339. [Google Scholar] [CrossRef]
Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. arXiv 2019, arXiv:1902.00751. [Google Scholar] [CrossRef]
Soudani, H.; Kanoulas, E.; Hasibi, F. Data augmentation for conversational ai. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Singapore, 13–17 May 2023; pp. 5220–5223. [Google Scholar]
Mosbach, M.; Pimentel, T.; Ravfogel, S.; Klakow, D.; Elazar, Y. Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 12284–12314. [Google Scholar] [CrossRef]
Juvekar, K.; Purwar, A. Introducing a new hyper-parameter for RAG: Context Window Utilization. arXiv 2024, arXiv:2407.19794. [Google Scholar] [CrossRef]
Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-rag: Self-reflective retrieval augmented generation. arXiv 2023, arXiv:2310.11511. [Google Scholar]
Gao, L.; Ma, X.; Lin, J.; Callan, J. Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 1762–1777. [Google Scholar]
Izacard, G.; Grave, E. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; Merlo, P., Tiedemann, J., Tsarfaty, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 874–880. [Google Scholar] [CrossRef]
Douze, M.; Guzhva, A.; Deng, C.; Johnson, J.; Szilvasy, G.; Mazar’e, P.E.; Lomeli, M.; Hosseini, L.; J’egou, H. The Faiss library. arXiv 2024, arXiv:2401.08281. [Google Scholar] [CrossRef]
Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Larson, J. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv 2024, arXiv:2404.16130. [Google Scholar] [CrossRef]
Clemedtson, A.; Shi, B. GraphRAFT: Retrieval Augmented Fine-Tuning for Knowledge Graphs on Graph Databases. arXiv 2025, arXiv:2504.05478. [Google Scholar]
Hindi, M.; Mohammed, L.; Maaz, O.; Alwarafy, A. Enhancing the precision and interpretability of retrieval-augmented generation (rag) in legal technology: A survey. IEEE Access 2025, 13, 46171–46189. [Google Scholar] [CrossRef]
Wiratunga, N.; Abeyratne, R.; Jayawardena, L.; Martin, K.; Massie, S.; Nkisi-Orji, I.; Weerasinghe, R.; Liret, A.; Fleisch, B. CBR-RAG: Case-Based Reasoning for Retrieval Augmented Generation in LLMs for Legal Question Answering. In Proceedings of the 32nd International Conference, ICCBR 2024, Merida, Mexico, 1–4 July 2024. [Google Scholar]
Stewart Kirubakaran, S.; Jasper Wilsie Kathrine, G.; Grace Mary Kanaga, E.; Mahimai Raja, J.; Ruban Gino Singh, A.; Yuvaraajan, E. A RAG-based Medical Assistant Especially for Infectious Diseases. In Proceedings of the 2024 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal, 24–26 April 2024; pp. 1128–1133. [Google Scholar] [CrossRef]
Wong, L. Gaita: A RAG System for Personalized Computer Science Education. Master’s Thesis, Johns Hopkins University, Baltimore, MD, USA, 2024. [Google Scholar]
Modran, H.A.; Bogdan, I.C.; Ursuțiu, D.; Samoilă, C.; Modran, P.L. LLM intelligent agent tutoring in higher education courses using a RAG approach. In Proceedings of the International Conference on Interactive Collaborative Learning; Springer: Cham, Switzerland, 2024; pp. 589–599. [Google Scholar]
Zhao, C.; Agrawal, G.; Kumarage, T.; Tan, Z.; Deng, Y.; Chen, Y.C.; Liu, H. Ontology-Aware RAG for Improved Question-Answering in Cybersecurity Education. arXiv 2024, arXiv:2412.14191. [Google Scholar]
Zhu, Y.; Ren, C.; Wang, Z.; Zheng, X.; Xie, S.; Feng, J.; Zhu, X.; Li, Z.; Ma, L.; Pan, C. Emerge: Integrating rag for improved multimodal ehr predictive modeling. arXiv 2024, arXiv:2406.00036. [Google Scholar]
Chen, W.; Zhou, M.; Fan, X.; Zhou, L.; Zhu, S.; Cai, T. Application of Retrieval-Augmented Generation in video. In Proceedings of the 2024 12th International Conference on Information Systems and Computing Technology (ISCTech), Xi’an, China, 8–11 November 2024; pp. 1–5. [Google Scholar]
Li, X. Application of RAG model based on retrieval enhanced generation technique in complex query processing. Adv. Comput. Signals Syst. 2024, 8. Available online: https://api.semanticscholar.org/CorpusID:272955388 (accessed on 23 August 2025). [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
Ng, J.P.; Abrecht, V. Better Summarization Evaluation with Word Embeddings for ROUGE. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; Màrquez, L., Callison-Burch, C., Su, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 1925–1930. [Google Scholar] [CrossRef]
Lin, C.Y.; Och, F. Looking for a few good metrics: ROUGE and its evaluation. In Proceedings of the Ntcir Workshop, Tokyo, Japan, 2–4 June 2004; pp. 1–8. [Google Scholar]
Kurokawa, R.; Ohizumi, Y.; Kanzawa, J.; Kurokawa, M.; Sonoda, Y.; Nakamura, Y.; Kiguchi, T.; Gonoi, W.; Abe, O. Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology’s “Diagnosis Please” cases. Jpn. J. Radiol. 2024, 42, 1399–1402. [Google Scholar] [CrossRef] [PubMed]
Islam, R.; Ahmed, I. Gemini-the most powerful LLM: Myth or Truth. In Proceedings of the 2024 5th Information Communication Technologies Conference (ICTC), Nanjing, China, 10–12 May 2024; pp. 303–308. [Google Scholar]
Phan, H.; Acharya, A.; Chaturvedi, S.; Sharma, S.; Parker, M.; Nally, D.; Jannesari, A.; Pazdernik, K.; Halappanavar, M.; Munikoti, S.; et al. RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension. arXiv 2024, arXiv:2407.07321. [Google Scholar] [CrossRef]
Xu, P.; Ping, W.; Wu, X.; Xu, C.; Liu, Z.; Shoeybi, M.; Catanzaro, B. ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities. arXiv 2024, arXiv:2407.14482. [Google Scholar] [CrossRef]
Tsai, H.C.; Huang, Y.F.; Kuo, C.W. Comparative analysis of automatic literature review using mistral large language model and human reviewers. Res. Square, 2024; ahead of print. Available online: https://sciety.org/articles/activity/10.21203/rs.3.rs-4022248/v1 (accessed on 23 August 2025).
Can, E.; Uller, W.; Vogt, K.; Doppler, M.C.; Busch, F.; Bayerl, N.; Ellmann, S.; Kader, A.; Elkilany, A.; Makowski, M.R.; et al. Large language models for simplified interventional radiology reports: A comparative analysis. Acad. Radiol. 2025, 32, 888–898. [Google Scholar] [CrossRef]
Kim, T.; Wang, Y.; Chaturvedi, V.; Gupta, L.; Kim, S.; Kwon, Y.; Ha, S. LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24), Jeju, Republic of Korea, 3–9 August 2024; pp. 6324–6332. [Google Scholar] [CrossRef]
Thor, W.M. How To Calculate GPU VRAM Requirements for an Large-Language Model. ApX Machine Learning Blog Post. Last Updated 15 July 2025. Available online: https://apxml.com/posts/how-to-calculate-vram-requirements-for-an-llm (accessed on 3 June 2025).
Yin, C.; Zhang, Z. A Study of Sentence Similarity Based on the All-minilm-l6-v2 Model with “Same Semantics, Different Structure” After Fine Tuning. In Proceedings of the 2024 2nd International Conference on Image, Algorithms and Artificial Intelligence (ICIAAI 2024), Singapore, 9–11 August 2024; Atlantis Press: Dordrecht, The Netherlands, 2024; pp. 677–684. [Google Scholar]
Wawrzik, F.; Plaue, M.; Vekariya, S.; Grimm, C. Customized Information and Domain-centric Knowledge Graph Construction with Large Language Models. arXiv 2024, arXiv:2409.20010. [Google Scholar] [CrossRef]
Hakdağlı, Ö. Hybrid Question-Answering System: A FAISS and BM25 Approach for Extracting Information from Technical Document. Orclever Proc. Res. Dev. 2024, 5, 226–237. [Google Scholar] [CrossRef]
Moreira, G.D.S.P.; Ak, R.; Schifferer, B.D.; Xu, M.; Osmulski, R.; Oldridge, E. Enhancing Q&A Text Retrieval with Ranking Models: Benchmarking, fine-tuning and deploying Rerankers for RAG. arXiv 2024, arXiv:2409.07691. [Google Scholar]

Figure 1. Research methodology framework.

Figure 2. RAG-Augmented LLM workflow.

Figure 3. Lexical similarity score comparison across baseline LLMs and RAG-Augmented LLMs.

Figure 4. Semantic similarity score comparison across baseline LLMs and RAG-Augmented LLMs.

Figure 5. Comparison of Llama 3.1 8B to RAG-Augmented Mistral 7B.

Figure 6. Comparison of Llama 1 13B to RAG-Augmented Mistral 7B.

Figure 7. Comparison of baseline Llama 1 13B to RAG-Augmented Llama 3.1 8B.

Table 1. Lexical and semantic evaluation metrics used in the study.

Category	Metric	Description
Lexical	BLEU	Measures n-gram precision (up to 4 grams) between candidate and reference texts; emphasizes exact token overlap.
	ROUGE-1	Calculates unigram recall—proportion of individual words in the reference that appear in the candidate.
	ROUGE-2	Computes bigram recall—captures local fluency by assessing short-phrase overlaps.
	ROUGE-L	Evaluates the longest common subsequence recall—rewards in-order matches, indicating structural alignment.
Semantic	BERT Precision	Semantic precision based on cosine similarity between generated and reference tokens in BERT embedding space.
	BERT Recall	Measures the extent to which the candidate text semantically covers the reference in the embedding space.
	BERT F1	Harmonic mean of BERT precision and recall, reflecting overall semantic similarity.

Table 2. A summary of baseline Large Language Models (LLMs) used in the study.

LLM	Para	Min VRAM	Strengths	Limitations
TinyLlama	1.1 B	1 GB	Lightweight and optimized for edge deployment	Limited language understanding and generation capacity
Mistral 7B	7.3 B	7 GB	High performance with efficient inference	Smaller parameter size may limit expressiveness in complex tasks
Llama 3.1 8B	8 B	8 GB	Enhanced reasoning capabilities and modern architecture	Relatively smaller size compared to high-end LLMs
Llama 1 13B	13 B	13 GB	Strong few-shot learning performance	Based on older architecture with fewer optimisations

Table 3. Improvement of RAG-Augmented LLMs over their respective baseline LLMs.

Evaluation Metric		Mistral	TinyLlama	Llama 3.1 8B	Llama 1 13B
Lexical	BLEU Score	60.50%	27.70%	216.90%	80.50%
Lexical	ROUGE-L	28.00%	30.50%	145.10%	18.20%
Semantic	BERT Recall	2.70%	2.30%	1.80%	0.70%
Semantic	BERT F1 Score	2.80%	1.80%	1.20%	0.45%

Table 4. Performance improvement of RAG-Augmented Mistral 7B over Llama 3.1 8B.

Category	Metric	RAG-Augmented	Llama 3.1 8B	Improvement (%)
Lexical	BLEU	0.06431	0.00819	684.80
	ROUGE-1	0.28419	0.04915	478.20
	ROUGE-2	0.12181	0.02272	436.20
	ROUGE-L	0.22720	0.04370	419.90
Semantic	BERT Precision	0.85767	0.76887	11.50
	BERT Recall	0.89479	0.87057	2.80
	BERT F1	0.87551	0.81923	6.90

Table 5. Performance improvement of RAG-Augmented Mistral 7B over baseline Llama 1 13B.

Category	Metric	RAG-Augmented	Baseline	Improvement (%)
Lexical	BLEU	0.06431	0.01722	273.50
	ROUGE-1	0.28419	0.08463	235.80
	ROUGE-2	0.12181	0.03551	243.00
	ROUGE-L	0.22720	0.07529	201.80
Semantic	BERT Precision	0.85767	0.78000	10.00
	BERT Recall	0.89479	0.85584	4.60
	BERT F1	0.87551	0.80730	8.40

Table 6. Performance improvement of RAG-Augmented Llama 3.1 8B over baseline Llama 1 13B.

Category	Metric	RAG-Augmented	Baseline	Improvement (%)
Lexical	BLEU	0.02597	0.01722	50.84
	ROUGE-1	0.12596	0.08463	48.84
	ROUGE-2	0.05229	0.03551	47.25
	ROUGE-L	0.10712	0.07529	42.28
Semantic	BERT Precision	0.77903	0.78000	−0.12
	BERT Recall	0.88599	0.85584	3.52
	BERT F1	0.82919	0.80730	2.71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vinayan Kozhipuram, A.; Shailendra, S.; Kadel, R. Retrieval-Augmented Generation vs. Baseline LLMs: A Multi-Metric Evaluation for Knowledge-Intensive Content. Information 2025, 16, 766. https://doi.org/10.3390/info16090766

AMA Style

Vinayan Kozhipuram A, Shailendra S, Kadel R. Retrieval-Augmented Generation vs. Baseline LLMs: A Multi-Metric Evaluation for Knowledge-Intensive Content. Information. 2025; 16(9):766. https://doi.org/10.3390/info16090766

Chicago/Turabian Style

Vinayan Kozhipuram, Aparna, Samar Shailendra, and Rajan Kadel. 2025. "Retrieval-Augmented Generation vs. Baseline LLMs: A Multi-Metric Evaluation for Knowledge-Intensive Content" Information 16, no. 9: 766. https://doi.org/10.3390/info16090766

APA Style

Vinayan Kozhipuram, A., Shailendra, S., & Kadel, R. (2025). Retrieval-Augmented Generation vs. Baseline LLMs: A Multi-Metric Evaluation for Knowledge-Intensive Content. Information, 16(9), 766. https://doi.org/10.3390/info16090766

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Retrieval-Augmented Generation vs. Baseline LLMs: A Multi-Metric Evaluation for Knowledge-Intensive Content

Abstract

1. Introduction

2. Background

2.1. Artificial Intelligence (AI) and Large Language Model (LLM)

2.2. Emergence of Retrieval-Augmented Generation (RAG)

2.3. Evaluation Metrics

2.4. Gaps in Existing Research

3. Research Methodology

3.1. Foundation

3.1.1. Data Creation

3.1.2. Ground Truth Benchmark

3.2. Implementation

3.2.1. LLM Architecture

Baseline LLMs

RAG-Augmented LLMs

3.2.2. Evaluation Criteria

Lexical Similarity Score

Semantic Similarity Score

3.3. Analysis

4. Results and Discussion

4.1. Comparative Analysis: RAG-Augmented LLM vs. Baseline LLM

4.2. Comparing RAG-Augmented Small LLMs with Large-Scale Baseline LLMs

5. Conclusions and Future Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI