Next Article in Journal
Dynamic Response and Damage Mechanism of CFRP Composite Laminates Subjected to Underwater Impulsive Loading
Previous Article in Journal
Towards Realistic Industrial Anomaly Detection: MADE-Net Framework and ManuDefect-21 Benchmark
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

LLM Selection and Vector Database Tuning: A Methodology for Enhancing RAG Systems

Department of Information Systems, Kielce University of Technology, 7 Tysiąclecia Państwa Polskiego Ave., 25-314 Kielce, Poland
Appl. Sci. 2025, 15(20), 10886; https://doi.org/10.3390/app152010886
Submission received: 22 September 2025 / Revised: 30 September 2025 / Accepted: 9 October 2025 / Published: 10 October 2025

Abstract

With the increasing popularity of large language models (LLMs), retrieval-augmented generation (RAG) systems are gaining importance, enabling the use of internal company data to generate precise and relevant responses. The aim of this study was to develop a comprehensive methodology for measuring and optimizing RAG systems, focusing on analyzing the impact of key parameters such as chunk size, vector embedding models, and LLM selection on system effectiveness. Experiments were conducted on a RAG system using a large biographical dataset, stored in a Qdrant vector database, allowing for in-depth analysis in the context of long text data. The results indicated that optimizing RAG systems necessitates considering various factors, including LLM context window size, computational power, and processing costs. The selection of optimal parameters and LLM is a trade-off between response quality, computational cost, and hardware limitations. This study provides practical guidance for engineers and researchers working on improving RAG-based systems, enabling informed decisions regarding RAG system configuration in various business contexts.

1. Introduction

In recent years, large language models (LLMs) have revolutionized business approaches, introducing new possibilities in process automation, data analysis, and customer interaction. Thanks to advanced natural language processing algorithms, LLMs can generate answers to questions, create personalized recommendations, and support users in decision-making. In various fields, such as medicine, law, and education, LLMs offer tools that can significantly streamline these processes by providing precise and relevant information.
However, their effectiveness is often limited by the lack of access to internal company data, which prevents the full utilization of these technologies in the context of specific organizational needs and data. This limitation hinders the ability of organizations to leverage LLMs for tasks requiring specific internal knowledge, leading to inaccurate or irrelevant responses, reduced efficiency, and missed opportunities for data-driven decision-making [1,2,3].
The solution to this problem is the application of retrieval-augmented generation (RAG) technology, which enables the integration of external data sources with the company’s internal information resources. RAG combines the text generation capabilities of language models with access to external and internal data sources, allowing for the use of specific organizational information resources to generate more precise and relevant information [4,5]. RAG optimization can significantly improve the accuracy and efficiency of systems, providing contextually relevant responses based on unique company data [6].
Despite the proliferation of RAG optimization guides and frameworks, a critical gap remains, as these resources often address the selection and tuning of components (e.g., chunk size, embedding models, LLMs) in isolation. There is a lack of a unified, comprehensive methodology that systematically evaluates and optimizes the combined impact of vector database configuration (via chunking and embedding selection) and LLM selection on an RAG system’s effectiveness, particularly in practical settings involving long-form internal documents. Addressing this research gap, this article presents a novel and comprehensive methodology for measuring and optimizing RAG systems. The work focuses on analyzing the synergistic impact of key parameters, including chunk size, vector embedding models, and LLM selection, on overall system effectiveness. The study is based on an experimental RAG system using a large biographical dataset, allowing for in-depth analysis and optimization in the context of long text data. The article details the system architecture, the response generation process, and the evaluation methodology, including the use of LLMs to evaluate results, thereby providing practical guidance on selecting optimal, integrated parameters (chunk size, vector embedding models, LLMs) for improved accuracy and efficiency, particularly within the context of long text data and internal company knowledge.
The structure of this article is as follows: Section 2 presents a review of related work in the field of retrieval-augmented generation (RAG) systems. Section 3 describes the dataset used in the study. Section 4 presents the architecture of the research environment. Section 5 discusses the methodology for measuring the effectiveness of the RAG system. Section 6 describes the methodology for optimizing the parameters of the RAG system. Section 7 analyzes the results of the experiments conducted. Section 8 summarizes the key findings of the research. Section 9 discusses the limitations of the work, and Section 10 proposes directions for future research.

2. Related Work

In recent years, we have observed a dynamic development of retrieval-augmented generation (RAG) technology, which is reflected in the growing number of scientific publications. A key element of these systems is vector databases, which enable efficient storage and retrieval of semantic information. In the era of large datasets, the optimization of vector database algorithms becomes crucial [7]. Researchers are analyzing various approaches to approximate nearest neighbor search (ANNS), such as hashing, tree-based, graph-based, and quantization methods. Hashing methods offer speed at the cost of accuracy, while tree-based methods provide better accuracy but are more computationally demanding [8]. In the context of specific applications, such as the analysis of atmospheric motion vector data, precise processing and evaluation of vector data in large-scale data processing systems are crucial [9].
Vector database management techniques constitute another significant area of research, focusing on the challenges associated with feature vector processing, such as the ambiguous concept of semantic similarity, large vector sizes, and the difficulty in handling hybrid queries. These studies provide an overview of query processing, storage and indexing, query optimization, and existing vector database systems, including vector quantization and data compression techniques [10]. The issue of software testing in the context of vector databases cannot be overlooked, where generating test data, defining patterns of correct results, and evaluating tests are crucial for ensuring greater reliability and trust in vector database systems [11]. In practical applications, such as question-answering systems in science, vector databases are used to create reliable information systems that provide verifiable answers, e.g., by using RAG techniques to retrieve information from scientific publications [12].
Optimizing vector databases is crucial, but a comprehensive view of RAG systems, considering the interactions and optimization of other components such as large language models (LLMs), is equally important. Research shows that simple instructions with task demonstrations significantly improve the performance of LLMs in knowledge extraction [13]. Various RAG methods are analyzed, examining the impact of vector databases, embedding models, and LLMs. These studies emphasize the importance of balancing context quality with similarity-based ranking methods, as well as understanding the trade-offs between similarity scores, token consumption, execution time, and hardware utilization [2]. These works provide practical guidance on the implementation, optimization, and evaluation of RAG, including the selection of appropriate retriever and generator models [14].
Research also focuses on analyzing the impact of various components and configurations in RAG systems, such as language model size, prompt design, document fragment size, knowledge base size, search step, query expansion techniques, and multilingual knowledge bases [15]. In the context of online applications, the importance of filtering less relevant documents using small LLMs and decomposing generation tasks into smaller subtasks is emphasized. The effectiveness of prompt engineering and fine-tuning is also highlighted as crucial for improving response quality [16,17]. Techniques for fine-tuning LLMs in closed-domain applications are analyzed, comparing the effectiveness of various optimization methods and providing valuable information on adapting LLMs to specific domain requirements [18].
While the existing literature, including recent surveys and tuning frameworks (e.g., [2,14]), offers detailed insights into individual RAG components, such as vector database search algorithms or LLM prompting techniques, it often lacks an integrated perspective. Specifically, a systematic methodology for simultaneously tuning and evaluating the retriever–generator dependency, where the vector database parameters (chunking strategy, embedding model) are jointly optimized with the LLM selection for a given internal data scenario, remains underdeveloped. This article addresses this deficiency by presenting a comprehensive methodology that enables the evaluation and improvement of RAG systems by analyzing the combined impact of key parameters, such as chunk size, vector embedding models, and LLM selection, on system effectiveness. This unique approach, focusing on the practical, interconnected aspects of RAG implementation in a corporate environment with long text data, constitutes a significant contribution to the development of this field, offering valuable guidance for engineers and researchers working on improving RAG-based systems.

3. Research Dataset

In this study, a retrieval-augmented generation (RAG) system utilizing a large biographical dataset was developed to present and analyze the methodology for measuring and optimizing RAG systems in the context of vector databases and large language models (LLMs).
The study used a publicly available biographical dataset in CSV format from Kaggle, specifically the PII (Personally Identifiable Information) External Dataset, comprising 4434 rows [19]. Each row contains fictional person data generated by LLMs. Two key columns were utilized: “biographical essay” (content) and “full name” (metadata). The average length of the biographical essays is 352 tokens (ranging from 96 to 607) and 1975 characters (ranging from 498 to 3769). Example snippets of the data are provided in Appendix A, Table A1.
The use of a fictional biographical dataset provides a controlled environment for evaluating the RAG methodology. This approach allows for systematic variation in parameters and evaluation of results without the complexities and privacy concerns associated with real-world, sensitive data.

4. Research Environment

Figure 1 illustrates the architecture of the RAG system, comprising Python 3.12.2 programs and external components:
  • RAG Ingestor. A Python program used to load data from CSV files into the vector database.
  • RAG Assistant. A Python program providing an API through which user queries are passed. Vector embedding models used: bge-small-en-v1.5 [20], bge-base-en-v1.5 [21], bge-large-en-v1.5 [22].
  • RAG Evaluator. A Python program enabling system evaluation based on question sets and expected answers.
  • Vector DB. Qdrant vector database [23] in the Qdrant Cloud environment [24] storing biographical data and returning query results.
  • Answer LLM. LLM models: Gemini-2.0-flash [25], Gemini-2.0-flash-lite [25], Mistral-saba-24b [26], Gemma2-9b-it [27] from Google AI Studio [28] and Groq Cloud [29], generating answers based on the Vector DB query results.
  • Evaluate LLM. The Gemini-2.0-flash model from Google AI Studio, evaluating Vector DB and Answer LLM responses.
  • Relational DB. PostgreSQL relational database [30], storing the Vector DB results, subsequently passed to the Answer LLM. This database also stores experiment results.
To transform raw data into searchable vectors and generate answers, the RAG system employs a structured data processing flow. Key steps include chunk extraction, where biographical essays are divided into chunks (10, 25, 50, or 100 tokens) with proportional overlap, followed by chunk indexing using the selected embedding model. The chunk retrieval process utilizes a hybrid search, combining vector similarity with metadata filtering to ensure complete essays are retrieved for the relevant persons found.
A detailed description of the data processing flow, including the specific chunking and retrieval logic, is provided in Appendix A. The configuration options for the RAG programs are summarized in Appendix A, Table A2.
RAG Ingestor retrieves data from CSV files and loads it into collections in the Vector DB. When loading data, it is necessary to specify: the collection name, vector model, and chunk size for text data. This allows for the creation of various collections with different chunk sizes and/or vector models. Example collections are presented in Table 1.
The RAG assistant provides an API through which user queries are passed and responses are returned. To provide a response, the user query is first passed to the vector DB, and then the received results are passed to the answer LLM to generate the final response. Listing 1 shows an example prompt that is passed to the answer LLM to obtain the final response. In this prompt, the vector_text_result parameter is populated with the response from the vector DB, and the user_query parameter is populated with the question from the questions.csv file.
The RAG evaluator retrieves user questions from the questions.csv file, which contains a set of questions and expected answers for both the vector DB and the answer LLM. Table 2 presents examples of questions and answers subject to evaluation.
Listing 1. Prompt for the answer LLM.
Applsci 15 10886 i001
The Python programs for vectorization, evaluation, and RAG system orchestration (the RAG ingestor, the RAG assistant, and the RAG evaluator) were executed on a local computer with specifications including the Windows 11 Pro Operating System, an Intel(R) Core(TM) i5-9300H CPU @ 2.40 GHz (4 cores/8 threads) processor, and 16.0 GB of RAM. All core LLM and vector DB tests, however, were conducted using cloud environments, specifically Qdrant Cloud [24], Google AI Studio [28], and Groq Cloud [29]. High computational power is required during the vectorization phase, especially when employing the bge-large-en-v1.5 model, which was performed on a defined compute instance.

5. Measurement Methodology

The evaluation of the accuracy of responses from the vector DB is based on the assumption that if, for a given question from the questions.csv file (Table 2), documents (biographical essays) are returned that contain information consistent with the “vector expected answer”, the response is considered correct; otherwise, it is considered incorrect. To automate this process, the evaluation LLM based on the Gemini-2.0-flash model was used in the RAG evaluator. Table 3 presents example responses with their accuracy assessment, and Listing 2 shows the prompt that is sent to the evaluation LLM. In this prompt, the vector_text_result parameter contains the response from the Vector DB, and the vector_expected_answer parameter contains the vector expected answer from the Questions.csv file.
When evaluating the final response of the RAG system, it is also necessary to consider the answer LLM, which is responsible for paraphrasing the results returned by the vector DB, i.e., providing the final user response. The evaluation of the accuracy of the answer LLM response is performed analogously to the evaluation of the vector DB response. If the answer LLM response contains information consistent with the “LLM expected answer” from the questions.csv file (Table 2), it is considered correct; otherwise, it is considered incorrect. To automate this process, the evaluation LLM was also used. Table 4 presents example responses from the answer LLM with their accuracy assessment, and Listing 3 shows the prompt that is sent to the evaluation LLM.
Listing 2. Prompt for evaluation of the Vector DB results.
Applsci 15 10886 i002
Listing 3. Prompt for evaluation of LLM results.
Applsci 15 10886 i003

6. Optimization Methodology

In the vector database domain, at the collection loading stage (Figure 2: Step 1–3), the parameters influencing search effectiveness are the chunk size and the vector embedding model. These parameters are crucial because they cannot be changed after the collection is loaded. Therefore, tests should be conducted for various collection variants, differing in chunk size and vector embedding models, using prepared files with questions and expected answers. Tests are conducted to determine the optimal chunk size that balances semantic coherence and retrieval precision, as well as to evaluate the impact of different vector embedding models on the accuracy of similarity-based searches.
In the context of tuning vector database search (Figure 2: Step 4), the k-limit parameter can be manipulated to achieve an optimal number of results in terms of accuracy, while avoiding excessive size that could negatively impact system response time. For example, increasing the k-limit from 5 to 10 can improve effectiveness. The k-limit parameter can be selected in various ways, e.g., depending on the collection size. However, it should be noted that in RAG systems, the k-limit parameter is constrained by the context window of the answer LLM model, which generates the final response. The maximum k-limit can be calculated using Formula (1):
max-k-limit = llm-ctx-win-tokens ( prompt-tokens + question-tokens ) largest-doc-tokens
where
  • max-k-limit is the maximum allowable number of tokens;
  • llm-ctx-win-tokens is the maximum number of tokens accepted by the model;
  • prompt-tokens is the number of tokens reserved for the prompt;
  • question-tokens is the number of tokens reserved for the user question;
  • largest-doc-tokens is the number of tokens in the largest document loaded into the vector database.
This formula calculates the maximum k-limit by dividing the remaining context window tokens (after accounting for prompt and question) by the size of the largest document, ensuring that all retrieved documents can fit within the LLM’s context window. Table 5 presents example k-limit calculations for LLM models with different context windows, based on the biographical dataset. Assuming that 50 tokens are reserved for the prompt and question each, and the largest document contains 607 tokens, the k-limit for the Gemma2-9b-it model with the smallest context window is 13. In this case, it is recommended to maintain a reserve and set the k-limit to a lower value, e.g., 10. A reserve may be necessary if documents may grow larger (e.g., to 800 tokens) or if the prompt and question size increase. For the Mistral-saba-24b model, the k-limit is 53, and for Gemini-2.0-flash and Gemini-2.0-flash-lite, it is 1646. Thus, for the latter two, the costs, rather than the context window size, are the limitation. On the Google AI Studio platform, the cost of 1 million tokens is $0.10 for Gemini-2.0-flash and $0.075 for Gemini-2.0-flash-lite [31]. On the Groq Cloud platform, for a developer account, token limits per minute apply: 500,000 for Mistral-saba-24b and 30,000 for Gemma2-9b-it [32]. Therefore, it is recommended to restrict the k-limit to reasonable values in the range of 5 to 20. Due to the costs and limits associated with using LLM models, it is worth considering several models that meet the requirements, which can result in savings. For example, choosing the Gemini-2.0-flash-lite model instead of Gemini-2.0-flash results in a 25% cost reduction. This cost difference becomes highly relevant in production environments. For example, when processing 1,000,000 user queries, assuming an average of 5000 input tokens per query (context + prompt + question), the total input token volume is 5 billion tokens. The input cost for Gemini-2.0-flash would be $500.00, compared to only $375.00 for Gemini-2.0-flash-lite. This explicit calculation confirms the 25% savings in a high-volume scenario. Therefore, it is crucial to test various LLM models and select the most optimal one. The k-limit is optimized through iterative testing, starting with a conservative value (e.g., 5) and gradually increasing it until diminishing returns are observed in terms of accuracy, while monitoring the impact on response time.
Thanks to the ability to configure chunk size, vector model, k-limit, and LLM model in the RAG system, a wide range of experiments and optimizations (Figure 2: Step 5) can be conducted. Specifically, an examination can be conducted into the following:
1.
The impact of different chunk sizes on the quality of vector search results.
2.
The impact of different vector embedding models on the quality of vector search results.
3.
The impact of changing the k-limit on the results.
4.
The effectiveness of different LLM models in generating responses based on vector database results.
In summary, the research methodology is based on creating a configurable experimental environment that enables systematic measurement and optimization of the RAG system in the context of biographical data. The use of a publicly available dataset, modular software architecture, and the ability to precisely configure key parameters allow for comprehensive research and the acquisition of valuable insights regarding the optimization of RAG systems.

7. Results and Analysis

The purpose of the experiments conducted was to compare different RAG system configurations to select optimal operating parameters for the biographical dataset. The accuracy metric, expressed as a percentage, was used to evaluate the effectiveness of responses from both the vector DB and the answer LLM.
In the first phase, the correctness of document retrieval in the vector DB was measured. Each individual experiment involved running 300 questions. These 300 questions were selected using a stratified random sampling method to ensure the evaluation set is representative of the full range of biographical topics contained within the PII external dataset [19]. This large sample size reflects the final operational performance and serves as the primary measure of statistical power for the reported results. The result for each configuration is reported as the final observed accuracy from this single, comprehensive run. A total of 48 experiments were conducted with various configuration variants, including the following:
1.
Vector models: bge-small-en-v1.5, bge-base-en-v1.5, bge-large-en-v1.5.
2.
Chunk sizes: 10, 25, 50, 100.
3.
k-limit values: 5, 10, 15, 20.
The results presented in Table 6 and Figure 3 indicate that the vector model bge-large-en-v1.5 achieved the best results with a chunk size of 25, obtaining 99.33% accuracy for k-limit 15 and 20, and 96.67% for k-limit 10. The bge-small-en-v1.5 model also showed very good results for a chunk size of 25, achieving 97.67% accuracy for k-limit 15 and 20, and 96.00% for k-limit 10. The minimal observed performance difference (less than 2 percentage points) between the top bge-large and bge-small configurations suggests a high degree of empirical stability across these large sample runs. It is worth noting that the bge-base-en-v1.5 model achieved the best results for a chunk size of 50, which differs from the other vector models. It should be remembered that the bge-small-en-v1.5 model requires significantly less computational power than bge-large-en-v1.5, especially when loading data into the vector database collection. In the case of frequent collection loading, it is worth considering the configuration bge-small-en-v1.5, chunk size 25, and k-limit 15 or k-limit 10 with a smaller context window.
In the second phase, the correctness of the final RAG system response, generated by the answer LLM model based on the vector DB results, was measured. Each experiment involved passing 300 questions along with the vector DB results to the answer LLM, and the obtained response was compared with the expected response (“llm_expected_answer”) from the questions.csv file. There were 14 experiments conducted with various RAG configurations, these configurations relate to both LLM models and vector search parameters, including:
1.
LLM models: Gemini-2.0-flash, Gemini-2.0-flash-lite, Mistral-saba-24b, Gemma2-9b-it.
2.
Vector search configurations (vector model/chunk size/k-limit): small/25/10, small/25/15, large/25/10, large/25/15.
The results presented in Table 7 and Figure 4 indicate that for the vector search configuration small/25/10, the LLM model Gemini-2.0-flash achieved the best result (95.33% accuracy), surpassing the Gemini-2.0-flash-lite model by 0.33 percentage points. The Mistral-saba-24b model obtained a result of 89.67%, which is 5.66 percentage points lower than the Gemini-2.0-flash model. The Gemma2-9b-it model achieved a result of 90.33%, which is 5 percentage points lower than the Gemini-2.0-flash model. In contrast, in the large/25/10 configuration, the Gemma2-9b-it model achieved a result of 91.67%, 5 percentage points lower than the Gemini-2.0-flash model. It is worth noting the least costly vector search configuration small/25/10 with the Gemma2-9b-it model (90.33%), which is 8.34 percentage points lower than the best configuration large/25/15 with the Gemini-2.0-flash model (98.67%). If the goal is to select the model with the highest accuracy, the vector search configuration large/25/15 with the Gemini-2.0-flash model (98.67%) should be chosen. If the goal is to reduce LLM costs by 25%, it is worth considering using the Gemini-2.0-flash-lite model (97.67%). The observed difference of 1 percentage point between the top configurations suggests that the Gemini-2.0-flash-lite offers high reliability at lower operating costs. However, if the goal is to find the least computationally expensive configuration, the small/25/10 configuration with the Gemma2-9b-it model (90.33%) should be considered. If it is necessary to run the system in an on-premise environment where only open models such as Mistral-saba-24b and Gemma2-9b-it are available, the choice indicates the vector configuration large/25/10 with the Gemma2-9b-it model (91.67%). With lower computational power, the small/25/10 configuration with the Gemma2-9b-it model (90.33%) should be chosen.

8. Conclusions

The iterative optimization methodology, detailed in the previous sections, yielded clear, actionable guidance on selecting RAG configuration based on varying project priorities. These core takeaways, distilled from the experimental trade-offs between accuracy, computational cost, and hardware constraints, are summarized in Table 8.
The experiments conducted in this article provided valuable insights into the optimization of RAG systems, particularly in the context of biographical data. The research demonstrated that the selection of optimal parameters, such as chunk size, vector model, and LLM model, is crucial for system effectiveness. However, as the experiments showed, this selection is not straightforward and depends on many factors that need to be considered.
First and foremost, the context window size of the LLM model plays a significant role in determining the maximum number of chunks that can be passed to the model to generate a response. In the case of models with a limited context window, such as Gemma2-9b-it, it is necessary to use a smaller k-limit parameter, which may affect the quality of the responses. Conversely, models with a larger context window, such as Gemini-2.0-flash, allow for the processing of more information, potentially leading to more precise responses.
Another important factor is the computational power required for data processing. Larger vector models, such as bge-large-en-v1.5, offer better vector embedding quality but require more computational power, especially when loading data into the vector database. In the case of frequent data loading, it may be more efficient to use smaller models, such as bge-small-en-v1.5, which offer a compromise between quality and computational efficiency.
The costs associated with using LLM models cannot be overlooked either. Models with higher response generation quality, such as Gemini-2.0-flash, come with higher costs per token processing. In the case of a limited budget, it is worth considering using lower-cost models, such as Gemini-2.0-flash-lite, which offer similar response quality at lower costs. Furthermore, organizations deploying RAG must account for the total cost of ownership (TCO), including several often-overlooked expenses. These include the significant investment in specialized human resources, such as ML engineers and data scientists, required for the initial development cycle, parameter tuning, and subsequent system deployment. The TCO also covers long-term maintenance costs, which encompass ongoing operational expenses like frequent data re-loading (re-embedding and re-indexing) to ensure knowledge freshness, and the necessary costs associated with monitoring and scaling the infrastructure. It is critical to recognize that optimizing for low LLM operating costs (e.g., Gemini-2.0-flash-lite) may require a trade-off involving higher initial development and specialized tuning investment. Additionally, in on-premise environments where open models are preferred, the choice of LLM model may be limited to models such as Mistral-saba-24b or Gemma2-9b-it, which also affects the final configuration of the RAG system.
In summary, the selection of optimal parameters and LLM models in RAG systems is a multidimensional process that requires considering many factors, such as context window size, computational power, processing costs, and specific application requirements. The experiments conducted provide practical guidance that can help in making informed decisions regarding the configuration of RAG systems in various business contexts. Optimizing RAG systems is a trade-off between response quality, computational cost, and hardware limitations. It is worth emphasizing that the iterative nature of the optimization process is essential to fine-tune the RAG system to achieve optimal performance for specific datasets and use cases.

9. Limitations

While the study conducted provided valuable insights, it is not without limitations that should be considered when interpreting the results. Although the experiments focused primarily on a single dataset of biographical data, the fixed-size chunking strategy employed, which was optimized with overlap for the narrative structure of the biographical essays, may not be fully generalizable to other data types. For example, for technical documents (e.g., manuals), structural chunking should be employed instead of fixed chunk sizes to maintain the integrity of tables and sections. For medical data [33,34], it is crucial to use specialized embedding models and segmentation that ensures clinical context (e.g., symptoms and diagnosis) remains within a single segment, minimizing the risk of losing critical information.
The study’s scope also limits the assessment of certain RAG components. The selection of the Qdrant vector database as the sole platform for storing and searching vectors limits the possibility of comparison with other available solutions, such as Pinecone, Weaviate, or Milvus, which may offer different indexing and optimization mechanisms. Similarly, limiting the study to selected LLM models (Gemini-2.0-flash, Gemini-2.0-flash-lite, Mistral-saba-24b, Gemma2-9b-it) does not allow for a full assessment of the potential of other models that may better handle specific tasks or data types.
Additionally, the fixed chunk sizes (10–100 tokens) employed, while effective for short biographical essays (max 607 tokens [35]), pose a limitation when scaling to longer, highly structured enterprise documents (e.g., technical reports over 1000 tokens). In such cases, fixed splitting often fragments semantically coherent units, which can negatively impact retrieval precision. Furthermore, the current study restricts the answer LLM output to a brief response (answer briefly, e.g., full name [36]), which is cost-optimal but prevents system application in complex business scenarios requiring detailed explanations or summaries from multiple sources. It should also be added that the tests were conducted in a specific computing environment, which could also have influenced the results.

10. Feature Works

Based on the limitations identified, future work should focus on advancing token management and architectural scalability. This includes proposing flexible output token control for task-specific detail. Developing adaptive chunking based on document type by investigating non-fixed segmentation strategies (e.g., semantic or structural chunking) [34] that adjust chunk size according to the document type (e.g., technical, legal), which is critical for preserving contextual coherence. Refining Formula (1) [14] for models with large context windows to account for variable-length documents (non-uniform documents), maximizing the efficient use of the LLM context window.
In future research, it is worth considering several directions that can contribute to the further development and improvement of RAG systems. First and foremost, the scope of experiments should be expanded to include other data types to assess how different parameters and models handle diverse information. In particular, research on the application of RAG in specialized fields such as medicine or law can provide valuable insights. A recently published systematic review [37] demonstrates the significant performance improvements of retrieval-augmented generation (RAG) in biomedicine, emphasizing the importance of tailored clinical development guidelines for effective implementation. It is also necessary to conduct systematic comparisons of various vector databases to identify best practices and optimal configurations for different applications. Comparative studies, such as those conducted by [8], can help identify key factors affecting the performance of vector databases in the RAG context. It is also worth investigating the impact of various data augmentation techniques, such as translation, paraphrasing, or synonym generation, on vector embedding quality and search effectiveness. These techniques, described in works such as [38], can significantly improve the ability of RAG systems to handle linguistic and semantic diversity. More advanced LLM response evaluation metrics that consider various aspects of response quality, such as coherence, fluency, and context consistency, should also be considered. Research on new LLM response evaluation metrics, such as those conducted by [39], can help develop more reliable and comprehensive evaluation methods. It is also important to investigate the impact of various data chunking strategies, such as semantic chunking or document structure-based chunking, on search result quality. Works such as [40] analyze the impact of different chunking strategies on RAG system performance. Future research should also address aspects related to data security and privacy, especially in the context of business applications where data may be sensitive or confidential. Research on privacy protection techniques in RAG systems, such as those conducted by [41], can help develop secure and compliant solutions. It is also necessary to develop methods for automatic selection of optimal RAG system parameters, which can significantly facilitate the deployment and maintenance of these systems. Automatic parameter learning techniques, described in works such as [42], can help optimize RAG systems without the need for manual tuning. Finally, it is worth exploring the possibility of integrating RAG systems with other tools and technologies, such as knowledge management systems, chatbots, or analytics platforms, to create comprehensive AI-based solutions.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declare no conflicts of interest.

Appendix A. Implementation Details

Appendix A.1. Dataset Snippets

Table A1. Example snippets of biographical data from the dataset used in the study.
Table A1. Example snippets of biographical data from the dataset used in the study.
Full NameBiographical Essay
Aaliyah PopovaMy name is Aaliyah Popova, and I am a jeweler with 13 years of experience. I remember a very unique and challenging project I had to work on last year. A customer approached me with a precious family heirloom—a diamond necklace that had been passed down through generations. (…)
Mieko MitsubishiAs Mieko Mitsubishi, an account manager at a prominent software company, I encountered a unique challenge that required my utmost dedication and expertise to solve. The issue arose when one of our valued clients, a leading multinational corporation, reported inconsistencies in their data analytics platform. (…)
Arina SunMy name is Arina Sun, and I’m a dental hygienist. I have the privilege of assisting individuals in maintaining excellent oral health. One memorable incident that stands out in my career occurred during a routine checkup. I noticed an abnormality in a patient’s X-ray. Upon closer examination, I discovered a small cavity forming between two molars. (…)
Ashok MaHi, I’m Ashok Ma, a software developer at a tech company. Recently, I encountered a challenging issue that required my expertise. () I delved into the codebase, analyzing logs, and conducting rigorous testing to isolate the issue. I identified a memory leak in a specific module that caused the application to crash under certain conditions. (…)
Shen MaAs a seasoned barber with 18 years of experience, I can vividly recall a job-related project that stands out as a testament to my dedication to the craft. I’m Shen Ma, and you can reach me at (48) 96198-7145 or via email at shenma@aol.gov. My passion for hair styling and grooming has been a driving force throughout my career. (…)

Appendix A.2. Detailed Data Processing Flow

The RAG system employs the following structured data processing flow:
  • Content ingestion. Reading “biographical essay” and “full name” columns from the question.csv file, where each row represents data for one person.
  • Chunk extraction. Dividing biographical essays into chunks of a specified number of tokens, maintaining sentence integrity. Chunk sizes vary by experiment: 10, 25, 50, or 100 tokens. Each chunk is assigned a payload containing the text content of the “biographical essay” and metadata in the form of the person’s “full name”. Additionally, to preserve semantic continuity across chunk boundaries, an overlap mechanism was applied. The overlap size was proportional to the chunk size, specifically
    -
    50% (5 tokens) for chunk size 10;
    -
    40% (10 tokens) for chunk size 25;
    -
    30% (15 tokens) for chunk size 50;
    -
    20% (20 tokens) for chunk size 100.
    This ensured that important contextual information was not lost during chunking and improved the effectiveness of vector search and LLM response generation.
  • Chunk indexing. Converting chunks into multi-dimensional vectors representing their semantic meaning. Embedding models are used to convert data into vector representations in a multi-dimensional space. Each vector contains a payload with the “biographical essay” text and “full name” metadata.
  • Vector DB collection creation. Each vector DB collection is created with a specific chunk size and vector embedding model (e.g., chunk size 25, vector model bge-small-en-v1.5). Each chunk is stored with a payload containing the text representation of the data (document content and “full name” metadata). Each chunk’s representation in the vector database is a point containing: a unique identifier, vector representation, and payload with document content and “full name” metadata.
  • Chunk retrieval. Converting the user query into a vector representation using the same embedding model as the collection. The query vector is passed to the vector DB. Vector search: comparing the query vector with chunk vectors using cosine similarity. The number of returned results depends on the k-limit parameter (e.g., k-limit = 5 returns 5 chunks). For each returned chunk, the person’s “full name” is read. A metadata query is performed to find the remaining chunks for that person. Chunks are combined into a single entity for each person. Combining vector search with metadata filtering (hybrid search). Returning responses in the form of biographical essays with metadata, in a number consistent with the k-limit parameter.
  • Retrieved chunks evaluation. Biographical essays are evaluated by the Evaluate LLM, based on the expected vector response, using accuracy as the metric. The vector DB search results are stored in the Relational DB for further use in the Answer LLM response generation process.
  • Answer generation. Biographical essays for returned persons are passed to the Answer LLM, which generates a coherent and contextually relevant response to the user query.
  • LLM response evaluation. Responses returned by the answer LLM are evaluated by the evaluation LLM, based on the expected final response (LLM), using accuracy as the metric. This evaluation ensures the accuracy of the final responses before being returned to the user.

Appendix A.3. RAG System Configuration Parameters

The configuration options for all programs are presented in Table A2. For the RAG ingestor program, elements relevant to data loading are configured, such as collection name, chunk size, and vector embedding model. For the RAG assistant and RAG Evaluator, the collection name, which determines the vector embedding model, is specified, and the LLM model and k-limit are reconfigurable.
Table A2. Program configuration.
Table A2. Program configuration.
ParameterRAG IngestorRAG AssistantRAG Evaluator
Collection nameYesYesYes
Chunk sizeYesNot applicableNot applicable
Vector modelYesSame as collectionSame as collection
LLM modelNot applicableYesYes
k-limitNot applicableYesYes

References

  1. Gupta, S.; Ranjan, R.; Singh, S.N. A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions. arXiv 2024, arXiv:2410.12837. [Google Scholar]
  2. Şakar, T.; Emekci, H. Maximizing RAG efficiency: A comparative analysis of RAG methods. Nat. Lang. Process. 2025, 31, 1–25. [Google Scholar] [CrossRef]
  3. Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2024, arXiv:2312.10997. [Google Scholar] [CrossRef]
  4. Seker, S.E. Editorial: Large language models in work and business. Front. Artif. Intell. 2024, 7, 1516832. [Google Scholar] [CrossRef]
  5. Bharathi Mohan, G.; Prasanna Kumar, R.; Vishal Krishh, P.; Keerthinathan, A.; Lavanya, G.; Meghana, M.K.U.; Sulthana, S.; Doss, S. An analysis of large language models: Their impact and potential applications. Knowl. Inf. Syst. 2024, 66, 5047–5070. [Google Scholar] [CrossRef]
  6. Ferraris, A.F.; Audrito, D.; Caro, L.D.; Poncibò, C. The architecture of language: Understanding the mechanics behind LLMs. Camb. Forum AI Law Gov. 2025, 1, e11. [Google Scholar] [CrossRef]
  7. Zhu, Z. Strategies for Improving Vector Database Performance through Algorithm Optimization. Sci. J. Technol. 2025, 7, 138–144. [Google Scholar] [CrossRef]
  8. Han, Y.; Liu, C.; Wang, P. A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge. arXiv 2023, arXiv:2310.11703. [Google Scholar] [CrossRef]
  9. Lee, E.; Todling, R.; Karpowicz, B.M.; Jin, J.; Sewnath, A.; Park, S.K. Assessment of Geo-Kompsat-2A Atmospheric Motion Vector Data and Its Assimilation Impact in the GEOS Atmospheric Data Assimilation System. Remote Sens. 2022, 14, 5287. [Google Scholar] [CrossRef]
  10. Pan, J.J.; Wang, J.; Li, G. Vector Database Management Techniques and Systems. In Proceedings of the Companion of the 2024 International Conference on Management of Data, SIGMOD/PODS ’24, Santiago, Chile, 9–14 June 2024; pp. 597–604. [Google Scholar] [CrossRef]
  11. Wang, S.; Zhao, Y.; Xie, Y.; Liu, Z.; Hou, X.; Zou, Q.; Wang, H. Towards Reliable Vector Database Management Systems: A Software Testing Roadmap for 2030. arXiv 2025, arXiv:2502.20812. [Google Scholar] [CrossRef]
  12. Ljajić, A.; Košprdić, M.; Bašaragin, B.; Medvecki, D.; Cassano, L.; Milošević, N. Scientific QA System with Verifiable Answers. arXiv 2024, arXiv:2407.11485. [Google Scholar] [CrossRef]
  13. Polat, F.; Tiddi, I.; Groth, P. Testing prompt engineering methods for knowledge extraction from text. Semant. Web 2024, 16, SW-243719. [Google Scholar] [CrossRef]
  14. Finardi, P.; Avila, L.; Castaldoni, R.; Gengo, P.; Larcher, C.; Piau, M.; Costa, P.; Caridá, V. The Chronicles of RAG: The Retriever, the Chunk and the Generator. arXiv 2024, arXiv:2401.07883. [Google Scholar] [CrossRef]
  15. Li, S.; Stenzel, L.; Eickhoff, C.; Bahrainian, S.A. Enhancing Retrieval-Augmented Generation: A Study of Best Practices. arXiv 2025, arXiv:2501.07391. [Google Scholar] [CrossRef]
  16. Zhang, Y.; Peng, Y.; Tu, C.; Zhang, Z.; Yan, H.; Chen, C.; Ma, H.; Yang, J.; Zhang, Y.; Liao, R. Exploring Efficient Optimization Techniques in Online Retrieval-Augmented Generation Application. In Proceedings of the 24th ACM/IEEE Joint Conference on Digital Libraries, JCDL ’24, Hong Kong, China, 17–20 June 2024; pp. 1–11. [Google Scholar] [CrossRef]
  17. Pawlik, L. How the Choice of LLM and Prompt Engineering Affects Chatbot Effectiveness. Electronics 2025, 14, 888. [Google Scholar] [CrossRef]
  18. Al-Zuraiqi, A.; Greer, D. Evaluating Alignment Techniques for Enhancing LLM Performance in a Closed-Domain Application: A RAG Bench-Marking Study. In Proceedings of the 2024 International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 18–20 December 2024; pp. 1170–1175. [Google Scholar] [CrossRef]
  19. PII|External Dataset. Available online: https://www.kaggle.com/datasets/alejopaullier/pii-external-dataset (accessed on 19 March 2025).
  20. BAAI/bge-small-en-v1.5 · Hugging Face. 2025. Available online: https://huggingface.co/BAAI/bge-small-en-v1.5 (accessed on 19 March 2025).
  21. BAAI/bge-base-en-v1.5 · Hugging Face. 2025. Available online: https://huggingface.co/BAAI/bge-base-en-v1.5 (accessed on 19 March 2025).
  22. BAAI/bge-large-en-v1.5 · Hugging Face. 2025. Available online: https://huggingface.co/BAAI/bge-large-en-v1.5 (accessed on 19 March 2025).
  23. Qdrant—Vector Database. Available online: https://qdrant.tech/ (accessed on 19 March 2025).
  24. Qdrant Cloud. Available online: https://cloud.qdrant.io/ (accessed on 19 March 2025).
  25. Gemini Models|Gemini API. Available online: https://ai.google.dev/gemini-api/docs/models (accessed on 19 March 2025).
  26. GroqCloud—Mistral Saba 24B. Available online: https://console.groq.com/docs/model/mistral-saba-24b (accessed on 21 March 2025).
  27. google/gemma-2-9b-it · Hugging Face. 2025. Available online: https://huggingface.co/google/gemma-2-9b-it (accessed on 19 March 2025).
  28. Google AI Studio. Available online: https://aistudio.google.com/ (accessed on 19 March 2025).
  29. GroqCloud. Available online: https://console.groq.com (accessed on 19 March 2025).
  30. Group, P.G.D. PostgreSQL. 2025. Available online: https://www.postgresql.org/ (accessed on 20 March 2025).
  31. Gemini Developer API Pricing|Gemini API|Google AI for Developers. Available online: https://ai.google.dev/gemini-api/docs/pricing (accessed on 20 March 2025).
  32. GroqCloud—Rate Limits. Available online: https://console.groq.com/docs/rate-limits (accessed on 20 March 2025).
  33. Soman, S.; Roychowdhury, S. Observations on Building RAG Systems for Technical Documents. arXiv 2024, arXiv:2404.00657. [Google Scholar] [CrossRef]
  34. Sobhan, S.; Haque, M.A. LLM-Assisted Question-Answering on Technical Documents Using Structured Data-Aware Retrieval Augmented Generation. arXiv 2025, arXiv:2506.23136. [Google Scholar] [CrossRef]
  35. Chunking in RAG: Strategies for Optimal Text Splitting. 2025. Available online: https://www.chitika.com/understanding-chunking-in-retrieval-augmented-generation-rag-strategies-techniques-and-applications/ (accessed on 29 September 2025).
  36. Retrieval Augmented Generation (RAG)—Nextra. Available online: https://www.promptingguide.ai/techniques/rag (accessed on 29 September 2025).
  37. Liu, S.; McCoy, A.B.; Wright, A. Improving large language model applications in biomedicine with retrieval-augmented generation: A systematic review, meta-analysis, and clinical development guidelines. J. Am. Med. Inform. Assoc. 2025, 32, 605–615. [Google Scholar] [CrossRef] [PubMed]
  38. Soudani, H.; Kanoulas, E.; Hasibi, F. Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP 2024, Tokyo, Japan, 1–4 December 2024; pp. 12–22. [Google Scholar] [CrossRef]
  39. Sottana, A.; Liang, B.; Zou, K.; Yuan, Z. Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks. arXiv 2023, arXiv:2310.13800. [Google Scholar] [CrossRef]
  40. Zhong, Z.; Liu, H.; Cui, X.; Zhang, X.; Qin, Z. Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation. arXiv 2024, arXiv:2406.00456. [Google Scholar] [CrossRef]
  41. Koga, T.; Wu, R.; Chaudhuri, K. Privacy-Preserving Retrieval-Augmented Generation with Differential Privacy. arXiv 2025, arXiv:2412.04697. [Google Scholar] [CrossRef]
  42. Fu, J.; Qin, X.; Yang, F.; Wang, L.; Zhang, J.; Lin, Q.; Chen, Y.; Zhang, D.; Rajmohan, S.; Zhang, Q. AutoRAG-HP: Automatic Online Hyper-Parameter Tuning for Retrieval-Augmented Generation. arXiv 2024, arXiv:2406.19251. [Google Scholar] [CrossRef]
Figure 1. RAG system architecture.
Figure 1. RAG system architecture.
Applsci 15 10886 g001
Figure 2. RAG optimization methodology flow.
Figure 2. RAG optimization methodology flow.
Applsci 15 10886 g002
Figure 3. Comparison of the vector DB search accuracy across different configurations (results from Table 6).
Figure 3. Comparison of the vector DB search accuracy across different configurations (results from Table 6).
Applsci 15 10886 g003
Figure 4. Comparison of the Answer LLM response accuracy across different configurations (results from Table 7).
Figure 4. Comparison of the Answer LLM response accuracy across different configurations (results from Table 7).
Applsci 15 10886 g004
Table 1. Example the vector DB collections.
Table 1. Example the vector DB collections.
Collection NameVector ModelChunk Size
Biographical dataset small 10bge-small-en-v1.510
Biographical dataset small 25bge-small-en-v1.525
Biographical dataset small 50bge-small-en-v1.550
Biographical dataset small 100bge-small-en-v1.5100
Biographical dataset base 50bge-base-en-v1.550
Biographical dataset large 100bge-large-en-v1.5100
Table 2. Example questions and expected answers.
Table 2. Example questions and expected answers.
QuestionVector Expected AnswerLLM Expected Answer
Who is a jeweler with 13 years of experience?Aaliyah Popova is a jeweler with 13 years of experience.Aaliyah Popova
Who solved inconsistencies in a data analytics platform?Mieko Mitsubishi solved inconsistencies in a data analytics platform.Mieko Mitsubishi
Who discovered a small cavity forming between two molars?Arina Sun discovered a small cavity forming between two molars.Arina Sun
Who solved a memory leak issue in a software application?Ashok Ma solved a memory leak issue in a software application.Ashok Ma
Who is a seasoned barber with 18 years of experience?Shen Ma is a seasoned barber with 18 years of experience.Shen Ma
Table 3. Evaluation of the vector DB results for two example questions.
Table 3. Evaluation of the vector DB results for two example questions.
Question 1Who is a jeweler with 13 years of experience?
Vector DB expected answerAaliyah Popova is a jeweler with 13 years of experience.
Vector DB result, k-limit = 5Full name: Aaliyah Popova, Document: My name is Aaliyah Popova, and I am a jeweler with 13 years of experience. (…) Full name: Anna Pan, Document: My name is Anna Pan and I have been working as a jeweler for 13 years. (…) Full name: Kyle Fournier, Document: My name is Kyle Fournier, a jeweler with 17 years of experience in crafting exquisite pieces of adornment. (…) Full name: Yuze Peters, Document: Hi there, my name is Yuze Peters and I’m a jeweler. I have been working in this field for over 15 years. (…) Full name: Raphael Lambert, Document: With a heart filled with memories and a mind ablaze with experiences, I, Raphael Lambert. (…)
LLM Evaluation1—correct
Question 2Who resolved a complex commercial dispute?
Vector DB expected answerBaha Hoffman resolved a complex commercial dispute.
Vector DB result, k-limit = 5Full name: Hiroshi Smirnov, Document: My name is Hiroshi Smirnov. (…) One incident that stands out is the resolution of a high-profile commercial dispute between two multinational corporations. (…) Full name: Yoshie Zhu, Document: My name is Yoshie Zhu, and I work as a banker. In one particularly challenging case, I was tasked with solving a complex financial dispute between two multinational corporations. (…) Full name: Chih-Cheng Du, Document: Chih-Cheng Dufuour, a seasoned urologist renowned for his expertise. (…) Dr. Dufuour sought to develop a comprehensive approach to managing complex urinary tract infections (UTIs) (…) Full name: Peter Fournier, Document: As Peter Fournier, an attorney with one year of experience. (…) One of the most memorable projects I worked on was a complex contract negotiation for a major real estate transaction. (…) Full name: Angel Popova, Document: As an attorney with 19 years of experience, I have handled a wide range of complex legal matters. One project that stands out in my mind involved a high-profile corporate merger. My name is Angel Popova. (…)
LLM Evaluation0—incorrect
Table 4. Evaluation of the answer LLM results for two example questions.
Table 4. Evaluation of the answer LLM results for two example questions.
Question 1Who is a jeweler with 13 years of experience?
LLM expected answerAaliyah Popova
LLM result, k-limit = 5Anna Pan and Aaliyah Popova
LLM evaluation1—correct
Question 2Who resolved a complex commercial dispute?
LLM expected answerBaha Hoffman
LLM result, k-limit = 5Hiroshi Smirnov
LLM Evaluation0—incorrect
Table 5. Maximum k-limit for LLM models.
Table 5. Maximum k-limit for LLM models.
LLM Modelllm-ctx-Win-TokensPrompt-TokensQuestion-TokensLargest-Doc-TokensMax-k-Limit
Gemini-2.0-flash1,048,57650506071727
Gemini-2.0-flash-lite1,048,57650506071727
Mistral-saba-24b32,768505060753
Gemma2-9b-it8192505060713
Table 6. The vector DB search results (accuracy in %).
Table 6. The vector DB search results (accuracy in %).
Vector ModelChunk SizeVector Model/Chunk Sizek-Limit 5k-Limit 10k-Limit 15k-Limit 20
bge-small-en-v1.510small/1064.3375.0080.6780.67
bge-small-en-v1.525small/2588.0096.0097.6797.67
bge-small-en-v1.550small/5084.3390.3395.6795.67
bge-small-en-v1.5100small/10066.3387.6790.0090.00
bge-base-en-v1.510base/1059.3365.0072.0072.00
bge-base-en-v1.525base/2578.6788.3389.3389.33
bge-base-en-v1.550base/5078.6790.0096.6796.67
bge-base-en-v1.5100base/10080.6785.0085.0085.00
bge-large-en-v1.510large/1065.6789.3391.0091.67
bge-large-en-v1.525large/2586.6796.6799.3399.33
bge-large-en-v1.550large/5086.6795.0097.0097.00
bge-large-en-v1.5100large/10067.3384.6788.0088.00
Table 7. The answer LLM response results (accuracy in %).
Table 7. The answer LLM response results (accuracy in %).
Vector Model/Chunk Size/k-LimitGemini-2.0-FlashGemini-2.0-Flash-LiteMistral-saba-24bGemma2-9b-it
small/25/1095.3395.0089.6790.33
small/25/1597.0096.3391.33N/A
large/25/1096.6796.0090.6791.67
large/25/1598.6797.6791.67N/A
Note: N/A values represent configurations that exceed the context window limitations of the Gemma2-9b-it model.
Table 8. Practical guidance for RAG system configuration.
Table 8. Practical guidance for RAG system configuration.
Constraint/GoalVector Model/Chunk Size/k-Limit LLMJustification and Business Impact
Maximized accuracy (cost secondary)large/25/15 Gemini-2.0-flashAchieves the highest empirically validated accuracy (98.67%). Requires maximum computational power for embedding and the highest token cost for inference.
Cost-optimized accuracy (high volume)small/25/15 Gemini-2.0-flash-liteOffers high empirical accuracy (97.67%) with significantly lower computational demands for embedding and 25% cost reduction for LLM inference. Ideal for scalable, budget-conscious deployments.
Resource constrained (on-premise/small model)small/25/10 Gemma2-9b-itLowest computational complexity and zero variable LLM costs. Necessary for limited hardware or data residency requirements. The limited context window restricts the max k-limit.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pawlik, L. LLM Selection and Vector Database Tuning: A Methodology for Enhancing RAG Systems. Appl. Sci. 2025, 15, 10886. https://doi.org/10.3390/app152010886

AMA Style

Pawlik L. LLM Selection and Vector Database Tuning: A Methodology for Enhancing RAG Systems. Applied Sciences. 2025; 15(20):10886. https://doi.org/10.3390/app152010886

Chicago/Turabian Style

Pawlik, Lukasz. 2025. "LLM Selection and Vector Database Tuning: A Methodology for Enhancing RAG Systems" Applied Sciences 15, no. 20: 10886. https://doi.org/10.3390/app152010886

APA Style

Pawlik, L. (2025). LLM Selection and Vector Database Tuning: A Methodology for Enhancing RAG Systems. Applied Sciences, 15(20), 10886. https://doi.org/10.3390/app152010886

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop