Design and Performance Evaluation of LLM-Based RAG Pipelines for Chatbot Services in International Student Admissions
Abstract
1. Introduction
- Domain-specific RAG application and evaluation: By applying RAG technology to the real and new (under-explored) domain of graduate school admissions, we developed a chatbot that solves the problem of fragmented information access in the graduate school admissions process. While many RAG papers are focused on “Benchmarks” and lack real service scenario-based evaluations, this study builds a college admissions Q&A specialized dataset to verify the feasibility of practical use through real-world performance comparison and evaluation.
- Evaluation refinement: We systematically experiment with various RAG pipelines with various RAG sub-technologies (commercial LLM/open-source LLM, Splitter, Retriever combinations, etc.) and precisely compare their performances. Unlike previous studies [25] that relied only on qualitative feedback, this study introduced a dual evaluation method that uses both LLM generation and human-tagged answer sets. For two sets of answers, the system is evaluated using the performance metrics of the RAG pipeline (answer relevance, fidelity, context recall, and context accuracy) and heuristic natural language processing metrics (ROUGE, BLEU, METEOR, and SemScore). The results provide insights into real-world environments by comparing the evaluation results for two sets of answers, as well as which search and generation combination generates the highest quality answers.
- Deploy ability analysis from a real-use perspective: This system utilizes both proprietary and open-source models to balance performance and latency. We experimentally evaluate the system using a combination of different search engines and chunking strategies and show that the open-source model can match or surpass the performance of commercial LLMs such as GPT-4o when properly optimized. To examine the efficiency of the system, we measure the response time of each RAG stage at the pipeline level and compare and analyze it on a dataset-by-dataset basis. Then, we compare and evaluate the accuracy based on RAG answer quality and the system efficiency based on latency. We address that although the open source LLM optimization can replace the commercial model in terms of answer quality, there is a trade-off with the efficiency for each open-source specialized model. This provides useful information for practical choices for public institutions with limited budget and computing resources, including national universities and international student support centers. Finally, by suggesting the applicability and limitations in real-world settings and improvement strategies for future research, this study provides a scalable and adaptable solution for educational institutions seeking to improve digital support services for international applicants.
2. Materials and Methods
- i.
- Dataset collection
- ii.
- Load and splitting
- iii.
- Embedding and vector store
- vi.
- Retriever
- v.
- Language model (LLM)
- vi.
- Dual Reference QA Datasets
- vii.
- Evaluation
3. Results and Discussion
3.1. Highlighting RAGAS Metrics
3.1.1. Comparison Performance and LLMs with LLM-Generated QA Dataset
3.1.2. Comparison Performance and LLMs with Human-Tagged QA Dataset
3.2. Challenges of Human-Curated QA in Retrieval-Augmented Pipelines
3.3. Measuring Latency of GPT-4o and LLAMA3
3.4. Improving the Performance of Ollama Models
- -
- GPU: NVIDIA Quadro RTX 4000 (8 GB VRAM)
- -
- CPU: Intel Core i9-10900X @ 3.70 GHz
- -
- RAM: 16 GB
- -
- OS: Windows 11 Pro
- -
- Framework: Ollama (local deployment, version 0.6.8), LangChain (version 0.3.25), FAISS (version 1.9.0.post1)
- -
- GPU: Intel (R) UHD Graphics 770 (32 GB VRAM)
- -
- CPU: 13th Gen Intel (R) Core (TM) i7-13700
- -
- RAM: 64 GB
- -
- OS: Windows 11 Pro
- -
- Framework: Ollama (local deployment, version 0.6.8), LangChain (version 0.3.25), FAISS (version 1.9.0.post1)
4. Conclusions and Future Works
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
AI | Artificial Intelligence |
BLEU | Bilingual Evaluation Understudy |
BM25 | Best Matching 25 |
CPU | Central Processing Unit |
DB | Data Base |
FAQ | Frequently Asked Questions |
FAISS | Facebook AI Similarity Search |
GPT | Generative Pretrained-trained Transformer |
GPU | Graphics Processing Unit |
LLAMA | Large Language Model Meta AI |
LLM | Large Language Model |
MMR | Maximal Marginal Relevance |
METEOR | Metric for Evaluation of Translation with Explicit Ordering |
NLP | Natural language processing |
OS | Operating System |
RAG | Retrieval-Augmented Generation |
RAGAS | Retrieval-Augmented Generation Assessment Score |
RAM | Random Access Memory |
ROUGE | Recall-Oriented Understudy for Gisting Evaluation |
References
- Adeshola, I.; Adepoju, A.P. The opportunities and challenges of ChatGPT in education. Interact. Learn. Environ. 2024, 32, 6159–6172. [Google Scholar] [CrossRef]
- Peyton, K.; Unnikrishnan, S.; Mulligan, B. A review of university chatbots for student support: FAQs and beyond. Discover Education 2025, 4, 21. [Google Scholar] [CrossRef]
- Huang, X. Chatbot: Design, Architecture, and Applications. Bachelor’s Thesis (Senior Capstone Thesis), University of Pennsylvania, Philadelphia, PA, USA, 3 May 2021. [Google Scholar]
- Nobre, G.X.; Moraes, G.; Franco, W.; Moreira, L.O. A Chatbot Approach to Automating FAQ Responses in an Undergraduate Course Domain. TCC. Thesis, Federal University of Ceará, Fortaleza, Brazil, 2019; p. 11. [Google Scholar]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2024, 1, 1–58. [Google Scholar] [CrossRef]
- Li, J.; Cheng, X.; Zhao, X.; Nie, J.-Y.; Wen, J.-R. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 6449–6464. [Google Scholar]
- Hu, Y.; Lu, Y. RAG and RAU: A Survey on Retrieval-Augmented Language Models in Natural Language Processing. arXiv 2024, arXiv:2404.19543. [Google Scholar] [CrossRef]
- Neupane, S.; Hossain, E.; Keith, J.; Tripathi, H.; Ghiasi, F.; Amiri Golilarz, N.; Amirlatifi, A.; Mittal, S.; Rahimi, S. From Questions to Insightful Answers: Building an Informed Chatbot for University Resources. arXiv 2024, arXiv:2405.08120. [Google Scholar] [CrossRef]
- Swacha, J.; Gracel, M. Retrieval-Augmented Generation (RAG) Chatbots for Education: A Survey of Applications. Appl. Sci. 2025, 15, 4234. [Google Scholar] [CrossRef]
- Superbi, J.; Pereira, H.; Santos, E.; Lattari, L.; Castro, B. Enhancing Large Language Model Performance on ENEM Math Questions Using Retrieval-Augmented Generation. In Proceedings of the Brazilian e-Science Workshop, Brasília-DF, Brazil, 14 October 2024. [Google Scholar]
- Xiong, G.; Jin, Q.; Wang, X.; Zhang, M.; Lu, Z.; Zhang, A. Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions. arXiv 2024, arXiv:2310.19988. [Google Scholar]
- Radeva, I.; Doychev, A.; Georgiev, I.; Tontchev, N. CRP-RAG: A Retrieval-Augmented Generation Framework for Supporting Complex Logical Reasoning and Knowledge Planning. Electronics 2024, 14, 47. [Google Scholar]
- Xu, K.; Zeng, Z.; Sun, C.; Liu, J. Web Application for Retrieval-Augmented Generation: Implementation and Testing. Electronics 2023, 13, 1361. [Google Scholar]
- Amugongo, L.M.; Mascheroni, P.; Brooks, S.G.; Doering, S.; Seidel, J. Retrieval Augmented Generation for Large Language Models in Healthcare: A Systematic Review. PLoS Digit. Health 2025, 4, e0000877. [Google Scholar] [CrossRef] [PubMed]
- Meng, F.; Zhang, Y.; Wang, J. Using the Retrieval-Augmented Generation to Improve the Question-Answering System in Human Health Risk Assessment. Electronics 2024, 14, 386. [Google Scholar] [CrossRef]
- Hsain, A.; El Housni, H. Large Language Model-powered Chatbots for Internationalizing Student Support in Higher Education. arXiv 2024, arXiv:2403.14702. [Google Scholar] [CrossRef]
- Nobre, G.X.; Moraes, G.; Franco, W.; Moreira, L.O. A Chatbot Approach to Automating FAQ Responses in an Educational Setting. In Proceedings of the 2019 ACM Conference on Innovation and Technology in Computer Science Education (ITiCSE ‘19), Aberdeen, UK, 15–17 July 2019; p. 590. [Google Scholar]
- Silkhi, H.; Bakkas, B.; Housni, K. Comparative Analysis of Rule-Based Chatbot Development Tools for Education Orientation: A RAD Approach. In Proceedings of the 7th International Conference on Networking, Intelligent Systems and Security (NISS 2024), Meknes, Morocco, 18–19 April 2024; ACM: New York, NY, USA, 2024; pp. 1–7. [Google Scholar]
- Han, R.; Zhang, Y.; Qi, P.; Xu, Y.; Wang, J.; Liu, L.; Wang, W.Y.; Min, B.; Castelli, V. RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval-Augmented Question Answering. arXiv 2024, arXiv:2407.13998. [Google Scholar]
- Saad-Falcon, J.; Khattab, O.; Potts, C.; Zaharia, M. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. B. In Proceedings of the 2024 NAACL-HLT (Long Papers), Mexico City, Mexico, 16–21 June 2024; Duh, K., Gomez, H., Bethard, S., Eds.; Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 338–354. [Google Scholar]
- Es, S.; James, J.; Espinosa Anke, L.; Schockaert, S. RAGAs: Automated Evaluation of Retrieval-Augmented Generation. B. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, St. Julians, Malta, 17–22 March 2024; Aletras, N., De Clercq, O., Eds.; Association for Computational Linguistics: St. Julians, Malta, 2024; pp. 150–158. [Google Scholar]
- Ru, D.; Qiu, L.; Hu, X.; Zhang, T.; Shi, P.; Chang, S.; Cheng, J.; Wang, C.; Sun, S.; Li, H.; et al. RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation. In NeurIPS 2024 Datasets & Benchmarks Track; Association for Computational Linguistics: Red Hook, NY, USA, 2024. [Google Scholar]
- Hou, Y.; Pascale, A.; Carnerero-Cano, J.; Tchrakian, T.; Marinescu, R.; Daly, E.; Padhi, I.; Sattigeri, P. Human-annotated contradictory instances. WikiContradict: A Benchmark for Evaluating LLMs on Real-World Contradictory Knowledge. Adv. Neural Inf. Process. Syst. 37 (NeurIPS 2024) 2024, 37, 109701–109747. [Google Scholar]
- Kamalloo, E.; Jafari, A.; Zhang, X.; Thakur, N.; Lin, J. HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution. arXiv 2023, arXiv:2307.16883. [Google Scholar]
- Nguyen, L.; Quan, T. URAG: Implementing a Unified Hybrid RAG for Precise Answers in University Admission Chatbots—A Case Study at HCMUT. Appl. Sci. 2025, 15, 1012. [Google Scholar]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
- Carbonell, J.; Goldstein, J. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 24–28 August 1998; ACM: New York, NY, USA, 1998; pp. 335–336. [Google Scholar]
- Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.T. Dense Passage Retrieval for Open-Domain Question Answering. arXiv 2020, arXiv:2004.04906. [Google Scholar] [CrossRef]
- Thakur, N.; Reimers, N.; Daxenberger, J.; Gurevych, I. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arXiv 2021, arXiv:2104.08663. [Google Scholar] [CrossRef]
- Ren, Y.; Yu, M.; Xiong, C.; Srinivasan, K.; Wang, Y.; Lin, J. Multi-Vector Retriever: Learning Diverse Representations for Efficient and Effective Dense Retrieval. arXiv 2023, arXiv:2305.14625. [Google Scholar]
- OpenAI. GPT-4o Technical Report. Available online: https://openai.com/index/gpt-4o/ (accessed on 28 June 2025).
- Gruber, J.B.; Weber, M. Rollama: An R package for using generative large language models through Ollama. arXiv 2024, arXiv:2404.07654. [Google Scholar] [CrossRef]
- Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Aidahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- OpenChat. Available online: https://huggingface.co/openchat/openchat_3.5 (accessed on 28 June 2025).
- Zephyr. Available online: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta (accessed on 28 June 2025).
- Neural Chat. Available online: https://huggingface.co/Intel/neural-chat-7b-v3 (accessed on 28 June 2025).
- RAGAS. Retrieval-Augmented Generation Evaluation. Available online: https://github.com/explodinggradients/ragas (accessed on 28 June 2025).
- Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, Proceedings of the ACL-04 Workshop, Barcelona, Spain, 25–26 July 2004; Association for Computational Linguistics: Los Angeles, CA, USA, 2004; pp. 74–81. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
- Lavie, A.; Agarwal, A. METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments. In Proceedings of the ACL Workshop on Statistical Machine Translation, Prague, Czech Republic, 23 June 2007; pp. 228–231. [Google Scholar]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv 2019, arXiv:1904.09675. [Google Scholar]
- Yu, M.; Dong, A.; Wang, J.; Zhang, M.; Sun, Z.; Hase, P.; Kandpal, N.; Reid, M.; Wang, X.; Klein, D.; et al. Agentic RAG: Orchestrating Modular Agents with Language Graphs. arXiv 2023, arXiv:2312.12947. [Google Scholar]
- Yan, S.-Q.; Gu, J.-C.; Zhu, Y.; Ling, Z.-H. Corrective RAG: Feedback-Driven Retrieval-Augmented Generation. arXiv 2024, arXiv:2402.11751. [Google Scholar]
- LangGraph. Available online: https://langchain-ai.github.io/langgraph/ (accessed on 28 June 2025).
Category | Item | Configuration/Description |
---|---|---|
Hardware Environment | GPU | NVIDIA Quadro RTX 4000 (8 GB) (Nvidia Corporation, Santa Clara, CA, USA) |
CPU | Intel(R) Core(TM) i9-10900X CPU @ 3.70 GHz (Intel Corporation, Santa Clara, CA, USA) | |
RAM | 16 GB | |
OS | Windows 11 Pro | |
Programming Environment | Python | Python 3.10 |
Main Libraries | LangChain, FAISS, HuggingFaceEmbeddings, PyPDFLoader, OpenAI, Ollama, PromptTemplate, RAGAS, RAGAS evaluation, pandas, text_splitter | |
LLM #1 (Cloud-based) | Model | GPT-4o (OpenAI) |
Context Length | 128 k | |
Hyperparameters | temperature: 0.2, max_tokens: 2048–8192 | |
LLM #2 (Local) | Model | llama3:latest (Ollama) |
Context Length | 8192 | |
Hyperparameters | Temperature:0.2, max_tokens: 512–1024 | |
Embedding Model | Model | sentence-transformers/all-MiniLM-L6-v2 |
Vector Dimension | 384 dimensions | |
Vector DB | Backend | FAISS |
Retriever | top-k | 5 |
Prompt | Structure | “You are an AI assistant helping international students with admission-related questions.
|
Algo. Name | LLM | Text Split Methods | Retrieval Methods | RAG Evaluation | Heuristic Methods | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
(Non LLM Based Metrics) | (Traditional NLP Metrics) | ||||||||||
Answer Relevancy | Faithfulness | Context Recall | Context Precision | ROUGE | BLEU | METEOR | SemScore | ||||
Algo-1 | OpenAI (GPT-4o) | Recursive Character Split | MMR | 0.819 ± 0.082 (95% CI [0.736, 0.901]) | 0.807 ± 0.073 (95% CI [0.733, 0.88]) | 0.578 ± 0.11 (95% CI [0.467, 0.689]) | 0.498 ± 0.09 (95% CI [0.407, 0.589]) | 0.37 ± 0.055 (95% CI [0.315, 0.426]) | 0.179 ± 0.049 (95% CI [0.13, 0.229]) | 0.483 ± 0.054 (95% CI [0.428, 0.538]) | 0.901 ± 0.01 (95% CI [0.891, 0.912]) |
Algo-2 | Dense Retriever | 0.817 ± 0.082 (95% CI [0.734, 0.899]) | 0.829 ± 0.074 (95% CI [0.755, 0.904]) | 0.663 ± 0.109 (95% CI [0.553, 0.772]) | 0.555 ± 0.092 (95% CI [0.463, 0.648]) | 0.345 ± 0.054 (95% CI [0.291, 0.4]) | 0.151 ± 0.049 (95% CI [0.101, 0.201]) | 0.463 ± 0.055 (95% CI [0.407, 0.518]) | 0.892 ± 0.011 (95% CI [0.88, 0.903]) | ||
Algo-3 | Hybrid (BM25 + Dense) | 0.8 ± 0.085 (95% CI [0.714, 0.885]) | 0.853 ± 0.068 (95% CI [0.785, 0.921]) | 0.691 ± 0.106 (95% CI [0.584, 0.797]) | 0.442 ± 0.078 (95% CI [0.364, 0.521]) | 0.357 ± 0.058 (95% CI [0.298, 0.415]) | 0.167 ± 0.051 (95% CI [0.115, 0.219]) | 0.478 ± 0.058 (95% CI [0.42, 0.536]) | 0.894 ± 0.011 (95% CI [0.882, 0.906]) | ||
Algo-4 | Semantic Chunking | MMR | 0.767 ± 0.093 (95% CI [0.674, 0.86]) | 0.756 ± 0.081 (95% CI [0.674, 0.838]) | 0.527 ± 0.114 (95% CI [0.413, 0.642]) | 0.416 ± 0.093 (95% CI [0.322, 0.51]) | 0.318 ± 0.056 (95% CI [0.262, 0.375]) | 0.14 ± 0.046 (95% CI [0.094, 0.186]) | 0.429 ± 0.06 (95% CI [0.368, 0.49]) | 0.886 ± 0.011 (95% CI [0.874, 0.897]) | |
Algo-5 | Dense Retriever | 0.801 ± 0.085 (95% CI [0.715, 0.886]) | 0.864 ± 0.068 (95% CI [0.796, 0.932]) | 0.715 ± 0.105 (95% CI [0.609, 0.821]) | 0.559 ± 0.089 (95% CI [0.469, 0.648]) | 0.349 ± 0.058 (95% CI [0.29, 0.407]) | 0.152 ± 0.047 (95% CI [0.105, 0.2]) | 0.458 ± 0.057 (95% CI [0.4, 0.515]) | 0.892 ± 0.011 (95% CI [0.88, 0.903]) | ||
Algo-6 | Hybrid (BM25 + Dense) | 0.253 ± 0.1 (95% CI [0.153, 0.353]) | 0.277 ± 0.105 (95% CI [0.171, 0.383]) | 0.458 ± 0.114 (95% CI [0.343, 0.573]) | 0.285 ± 0.087 (95% CI [0.197, 0.372]) | 0.2 ± 0.058 (95% CI [0.142, 0.259]) | 0.079 ± 0.044 (95% CI [0.034, 0.123]) | 0.209 ± 0.064 (95% CI [0.144, 0.274]) | 0.86 ± 0.011 (95% CI [0.849, 0.872]) | ||
Algo-7 | Ollama (llama3) | Recursive Character Split | MMR | 0.558 ± 0.111 (95% CI [0.446, 0.67]) | 0.803 ± 0.053 (95% CI [0.75, 0.856]) | 0.544 ± 0.111 (95% CI [0.432, 0.655]) | 0.487 ± 0.092 (95% CI [0.395, 0.579]) | 0.216 ± 0.034 (95% CI [0.181, 0.251]) | 0.057 ± 0.022 (95% CI [0.035, 0.08]) | 0.381 ± 0.04 (95% CI [0.34, 0.422]) | 0.865 ± 0.008 (95% CI [0.857, 0.873]) |
Algo-8 | Dense Retriever | 0.658 ± 0.103 (95% CI [0.554, 0.761]) | 0.842 ± 0.053 (95% CI [0.789, 0.895]) | 0.638 ± 0.11 (95% CI [0.528, 0.749]) | 0.557 ± 0.091 (95% CI [0.465, 0.649]) | 0.214 ± 0.031 (95% CI [0.183, 0.245]) | 0.053 ± 0.019 (95% CI [0.034, 0.072]) | 0.388 ± 0.038 (95% CI [0.35, 0.426]) | 0.862 ± 0.007 (95% CI [0.854, 0.869]) | ||
Algo-9 | Hybrid (BM25 + Dense) | 0.68 ± 0.1 (95% CI [0.58, 0.781]) | 0.844 ± 0.055 (95% CI [0.788, 0.9]) | 0.701 ± 0.105 (95% CI [0.595, 0.806]) | 0.427 ± 0.077 (95% CI [0.349, 0.505]) | 0.213 ± 0.03 (95% CI [0.183, 0.244]) | 0.047 ± 0.017 (95% CI [0.029, 0.065]) | 0.386 ± 0.038 (95% CI [0.347, 0.425]) | 0.861 ± 0.007 (95% CI [0.854, 0.868]) | ||
Algo-10 | Semantic Chunking | MMR | 0.478 ± 0.113 (95% CI [0.364, 0.591]) | 0.78 ± 0.059 (95% CI [0.721, 0.84]) | 0.513 ± 0.114 (95% CI [0.398, 0.628]) | 0.424 ± 0.09 (95% CI [0.333, 0.514]) | 0.192 ± 0.033 (95% CI [0.159, 0.226]) | 0.052 ± 0.021 (95% CI [0.031, 0.074]) | 0.373 ± 0.042 (95% CI [0.331, 0.416]) | 0.857 ± 0.007 (95% CI [0.849, 0.865]) | |
Algo-11 | Dense Retriever | 0.631 ± 0.106 (95% CI [0.525, 0.737]) | 0.837 ± 0.047 (95% CI [0.789, 0.884]) | 0.673 ± 0.11 (95% CI [0.563, 0.783]) | 0.562 ± 0.089 (95% CI [0.473, 0.651]) | 0.212 ± 0.029 (95% CI [0.182, 0.242]) | 0.059 ± 0.018 (95% CI [0.04, 0.077]) | 0.399 ± 0.037 (95% CI [0.361, 0.437]) | 0.861 ± 0.007 (95% CI [0.854, 0.869]) | ||
Algo-12 | Hybrid (BM25 + Dense) | 0.531 ± 0.109 (95% CI [0.422, 0.641]) | 0.823 ± 0.057 (95% CI [0.765, 0.88]) | 0.77 ± 0.096 (95% CI [0.674, 0.867]) | 0.477 ± 0.079 (95% CI [0.398, 0.557]) | 0.202 ± 0.031 (95% CI [0.17, 0.233]) | 0.045 ± 0.016 (95% CI [0.028, 0.061]) | 0.359 ± 0.037 (95% CI [0.322, 0.397]) | 0.857 ± 0.007 (95% CI [0.85, 0.865]) |
Algo. Name | LLM | Text Split Methods | Retrieval Methods | RAG Evaluation | Heuristic Methods | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
(Non LLM Based Metrics) | (Traditional NLP Metrics) | ||||||||||
Answer Relevancy | Faithfulness | Context Recall | Context Precision | ROUGE | BLEU | METEOR | SemScore | ||||
Algo-1 | OpenAI (GPT-4o) | Recursive Character Split | MMR | 0.731 ± 0.101 (95% CI [0.629, 0.833]) | 0.843 ± 0.068 (95% CI [0.774, 0.911]) | 0.451 ± 0.099 (95% CI [0.351, 0.551]) | 0.557 ± 0.088 (95% CI [0.468, 0.645]) | 0.209 ± 0.032 (95% CI [0.176, 0.241]) | 0.034 ± 0.014 (95% CI [0.02, 0.048]) | 0.247 ± 0.031 (95% CI [0.216, 0.278]) | 0.852 ± 0.006 (95% CI [0.846, 0.859]) |
Algo-2 | Dense Retriever | 0.724 ± 0.1 (95% CI [0.623, 0.825]) | 0.823 ± 0.078 (95% CI [0.745, 0.901]) | 0.555 ± 0.098 (95% CI [0.457, 0.654]) | 0.62 ± 0.09 (95% CI [0.529, 0.71]) | 0.209 ± 0.034 (95% CI [0.174, 0.244]) | 0.035 ± 0.013 (95% CI [0.021, 0.048]) | 0.256 ± 0.037 (95% CI [0.218, 0.294]) | 0.853 ± 0.006 (95% CI [0.846, 0.86]) | ||
Algo-3 | Hybrid (BM25 + Dense) | 0.752 ± 0.095 (95% CI [0.657, 0.847]) | 0.886 ± 0.055 (95% CI [0.83, 0.941]) | 0.57 ± 0.098 (95% CI [0.471, 0.668]) | 0.488 ± 0.08 (95% CI [0.407, 0.568]) | 0.213 ± 0.031 (95% CI [0.181, 0.244]) | 0.036 ± 0.013 (95% CI [0.022, 0.05]) | 0.268 ± 0.033 (95% CI [0.235, 0.301]) | 0.854 ± 0.006 (95% CI [0.847, 0.861]) | ||
Algo-4 | Semantic Chunking | MMR | 0.651 ± 0.111 (95% CI [0.54, 0.763]) | 0.803 ± 0.067 (95% CI [0.735, 0.87]) | 0.496 ± 0.107 (95% CI [0.389, 0.603]) | 0.595 ± 0.087 (95% CI [0.507, 0.683]) | 0.176 ± 0.029 (95% CI [0.146, 0.205]) | 0.027 ± 0.012 (95% CI [0.015, 0.039]) | 0.214 ± 0.033 (95% CI [0.18, 0.247]) | 0.844 ± 0.007 (95% CI [0.837, 0.851]) | |
Algo-5 | Dense Retriever | 0.76 ± 0.096 (95% CI [0.664, 0.856]) | 0.853 ± 0.067 (95% CI [0.786, 0.921]) | 0.63 ± 0.102 (95% CI [0.528, 0.733]) | 0.654 ± 0.088 (95% CI [0.565, 0.743]) | 0.207 ± 0.032 (95% CI [0.174, 0.239]) | 0.035 ± 0.013 (95% CI [0.022, 0.048]) | 0.26 ± 0.035 (95% CI [0.224, 0.295]) | 0.85 ± 0.007 (95% CI [0.843, 0.857]) | ||
Algo-6 | Hybrid (BM25 + Dense) | 0.062 ± 0.06 (95% CI [0.001, 0.122]) | 0.064 ± 0.062 (95% CI [0.001, 0.127]) | 0.317 ± 0.102 (95% CI [0.215, 0.419]) | 0.358 ± 0.089 (95% CI [0.269, 0.447]) | 0.036 ± 0.01 (95% CI [0.025, 0.046]) | 0.001 ± 0.001 (95% CI [-0.001, 0.003]) | 0.034 ± 0.013 (95% CI [0.021, 0.048]) | 0.817 ± 0.004 (95% CI [0.812, 0.822]) | ||
Algo-7 | Ollama (llama3) | Recursive Character Split | MMR | 0.491 ± 0.118 (95% CI [0.372, 0.609]) | 0.846 ± 0.051 (95% CI [0.795, 0.897]) | 0.393 ± 0.094 (95% CI [0.298, 0.488]) | 0.519 ± 0.091 (95% CI [0.427, 0.611]) | 0.127 ± 0.019 (95% CI [0.108, 0.146]) | 0.021 ± 0.008 (95% CI [0.012, 0.029]) | 0.209 ± 0.026 (95% CI [0.182, 0.235]) | 0.832 ± 0.006 (95% CI [0.826, 0.838]) |
Algo-8 | Dense Retriever | 0.519 ± 0.117 (95% CI [0.402, 0.636]) | 0.863 ± 0.053 (95% CI [0.809, 0.916]) | 0.509 ± 0.096 (95% CI [0.412, 0.606]) | 0.593 ± 0.087 (95% CI [0.506, 0.681]) | 0.147 ± 0.025 (95% CI [0.122, 0.172]) | 0.029 ± 0.013 (95% CI [0.016, 0.043]) | 0.241 ± 0.031 (95% CI [0.21, 0.272]) | 0.837 ± 0.006 (95% CI [0.83, 0.843]) | ||
Algo-9 | Hybrid (BM25 + Dense) | 0.463 ± 0.118 (95% CI [0.344, 0.581]) | 0.84 ± 0.051 (95% CI [0.788, 0.892]) | 0.551 ± 0.102 (95% CI [0.448, 0.654]) | 0.47 ± 0.085 (95% CI [0.385, 0.555]) | 0.15 ± 0.025 (95% CI [0.125, 0.176]) | 0.03 ± 0.012 (95% CI [0.017, 0.043]) | 0.245 ± 0.029 (95% CI [0.215, 0.274]) | 0.839 ± 0.005 (95% CI [0.833, 0.845]) | ||
Algo-10 | Semantic Chunking | MMR | 0.44 ± 0.12 (95% CI [0.319, 0.56]) | 0.826 ± 0.055 (95% CI [0.771, 0.881]) | 0.474 ± 0.108 (95% CI [0.366, 0.583]) | 0.561 ± 0.094 (95% CI [0.467, 0.656]) | 0.128 ± 0.02 (95% CI [0.107, 0.148]) | 0.02 ± 0.008 (95% CI [0.011, 0.028]) | 0.208 ± 0.024 (95% CI [0.183, 0.232]) | 0.832 ± 0.006 (95% CI [0.825, 0.838]) | |
Algo-11 | Dense Retriever | 0.409 ± 0.119 (95% CI [0.29, 0.529]) | 0.755 ± 0.066 (95% CI [0.689, 0.821]) | 0.594 ± 0.103 (95% CI [0.491, 0.697]) | 0.661 ± 0.088 (95% CI [0.572, 0.749]) | 0.13 ± 0.019 (95% CI [0.111, 0.15]) | 0.018 ± 0.007 (95% CI [0.01, 0.026]) | 0.226 ± 0.025 (95% CI [0.2, 0.251]) | 0.831 ± 0.005 (95% CI [0.825, 0.836]) | ||
Algo-12 | Hybrid (BM25 + Dense) | 0.373 ± 0.116 (95% CI [0.257, 0.49]) | 0.776 ± 0.064 (95% CI [0.712, 0.841]) | 0.637 ± 0.104 (95% CI [0.532, 0.741]) | 0.475 ± 0.083 (95% CI [0.391, 0.559]) | 0.134 ± 0.023 (95% CI [0.11, 0.158]) | 0.022 ± 0.009 (95% CI [0.013, 0.032]) | 0.2 ± 0.023 (95% CI [0.177, 0.224]) | 0.833 ± 0.005 (95% CI [0.827, 0.838]) |
Category | Specification |
---|---|
CPU | 13th Gen Intel Core i7-13700 |
GPU | Intel UHD Graphics 770 (32 GB VRAM) |
RAM | 64 GB |
Operating System | Windows 11 Pro |
Python Version | Python 3.13.0 |
Execution Environment | VS Code |
OpenAI Model Access | GPT-4o via API |
Ollama Model | Locally (LLaMA3, OpenChat, neural-chat, zephyr) |
Algo. Name | Embedding Time (s) | Retrieval Time (s) | Generation Time (s) | Total Latency (s) |
---|---|---|---|---|
GPT-4o | ||||
Algo-1 | 0.0219 | 0.0079 | 3.004 | 3.0338 |
Algo-2 | 0.0485 | 0.0286 | 2.3772 | 2.4542 |
Algo-3 | 0.0231 | 0.0101 | 2.3699 | 2.403 |
Algo-4 | 0.0174 | 0.0098 | 2.1486 | 2.1758 |
Algo-5 | 0.0204 | 0.0079 | 2.3437 | 2.3721 |
Algo-6 | 0.04014 | 0.03461 | 1.25614 | 1.33089 |
LLAMA3 | ||||
Algo-7 | 0.0447 | 0.0302 | 1.5059 | 1.5809 |
Algo-8 | 0.0273 | 0.0174 | 1.221 | 1.2657 |
Algo-9 | 0.03666 | 0.02672 | 1.28233 | 1.34571 |
Algo-10 | 0.0174 | 0.0066 | 1.3191 | 1.3431 |
Algo-11 | 0.019 | 0.0053 | 1.2576 | 1.2819 |
Algo-12 | 0.03443 | 0.02681 | 1.49136 | 1.5526 |
Algo. Name | Embedding Time (s) | Retrieval Time (s) | Generation Time (s) | Total Latency (s) |
---|---|---|---|---|
GPT-4o | ||||
Algo-1 | 0.0218 | 0.0105 | 2.9996 | 3.0318 |
Algo-2 | 0.02 | 0.0084 | 3.3575 | 3.3858 |
Algo-3 | 0.017 | 0.0092 | 4.0149 | 4.0412 |
Algo-4 | 0.0226 | 0.0095 | 2.9077 | 2.9398 |
Algo-5 | 0.0235 | 0.0125 | 2.6675 | 2.7035 |
Algo-6 | 0.02271 | 0.01816 | 1.37057 | 1.41144 |
LLAMA3 | ||||
Algo-7 | 0.0447 | 0.0302 | 1.5059 | 1.5809 |
Algo-8 | 0.0436 | 0.029 | 1.6173 | 1.6899 |
Algo-9 | 0.03445 | 0.02515 | 1.70024 | 1.75984 |
Algo-10 | 0.0188 | 0.0056 | 1.672 | 1.6964 |
Algo-11 | 0.0197 | 0.0034 | 1.7423 | 1.7654 |
Algo-12 | 0.03464 | 0.0259 | 1.79282 | 1.85336 |
Algo. Name | LLM | Text Split Methods | Retrieval Methods | Answer Relevancy | Faithfulness | Context Recall | Context Precision | Average |
---|---|---|---|---|---|---|---|---|
algo-13 | llama3 | Recursive | MultiQuery (dense retriever) | 0.594 ± 0.117 (95% CI [0.477, 0.712]) | 0.874 ± 0.053 (95% CI [0.82, 0.927]) | 0.725 ± 0.078 (95% CI [0.646, 0.804]) | 0.295 ± 0.051 (95% CI [0.244, 0.346]) | 0.622475 |
algo-14 | llama3 | Recursive | MultiVectorRetriever | 0.435 ± 0.119 (95% CI [0.316, 0.554]) | 0.86 ± 0.051 (95% CI [0.808, 0.911]) | 0.404 ± 0.104 (95% CI [0.3, 0.508]) | 0.604 ± 0.093 (95% CI [0.511, 0.698]) | 0.5762 |
algo-15 | OpenChat | Recursive | Dense Retriever | 0.669 ± 0.109 (95% CI [0.559, 0.779]) | 0.724 ± 0.081 (95% CI [0.642, 0.805]) | 0.51 ± 0.097 (95% CI [0.413, 0.607]) | 0.613 ± 0.091 (95% CI [0.522, 0.705]) | 0.6293 |
algo-16 | OpenChat | Recursive | MultiQuery (dense retriever) | 0.728 ± 0.101 (95% CI [0.627, 0.829]) | 0.796 ± 0.076 (95% CI [0.719, 0.872]) | 0.781 ± 0.075 (95% CI [0.706, 0.856]) | 0.264 ± 0.054 (95% CI [0.21, 0.318]) | 0.6426 |
algo-17 | zephyr | Recursive | Dense Retriever | 0.525 ± 0.114 (95% CI [0.411, 0.64]) | 0.818 ± 0.06 (95% CI [0.757, 0.879]) | 0.508 ± 0.098 (95% CI [0.41, 0.606]) | 0.589 ± 0.093 (95% CI [0.496, 0.683]) | 0.6108 |
algo-18 | Neural-chat | Recursive | Dense Retriever | 0.509 ± 0.118 (95% CI [0.39, 0.627]) | 0.62 ± 0.077 (95% CI [0.543, 0.697]) | 0.546 ± 0.093 (95% CI [0.452, 0.64]) | 0.603 ± 0.094 (95% CI [0.508, 0.697]) | 0.5698 |
algo-19 | openchat:7b-v3.5-q6_K | Recursive | Dense Retriever | 0.769 ± 0.092 (95% CI [0.677, 0.861]) | 0.788 ± 0.076 (95% CI [0.711, 0.865]) | 0.535 ± 0.102 (95% CI [0.433, 0.638]) | 0.596 ± 0.093 (95% CI [0.502, 0.69]) | 0.6726 |
algo-20 | openchat:7b-v3.5-q6_K | Recursive | MultiQuery (dense retriever) | 0.755 ± 0.095 (95% CI [0.66, 0.85]) | 0.792 ± 0.078 (95% CI [0.714, 0.87]) | 0.696 ± 0.086 (95% CI [0.61, 0.783]) | 0.571 ± 0.081 (95% CI [0.49, 0.652]) | 0.7041 |
algo-21 | openchat:7b-v3.5-q6_K | Semantic Chunking | MultiQuery (dense retriever) | 0.712 ± 0.103 (95% CI [0.608, 0.815]) | 0.73 ± 0.082 (95% CI [0.648, 0.813]) | 0.616 ± 0.1 (95% CI [0.516, 0.717]) | 0.51 ± 0.078 (95% CI [0.431, 0.588]) | 0.6424 |
algo-22 | openchat:7b-v3.5-0106-fp16 | Recursive | MultiQuery (dense retriever) | 0.864 ± 0.066 (95% CI [0.798, 0.93]) | 0.843 ± 0.066 (95% CI [0.777, 0.909]) | 0.617 ± 0.093 (95% CI [0.523, 0.711]) | 0.589 ± 0.078 (95% CI [0.51, 0.667]) | 0.7287 |
algo-23 | openchat:7b-v3.5-0106-fp16 | Recursive | MultiQuery (hybrid (bm25 + dense)) | 0.785 ± 0.088 (95% CI [0.696, 0.873]) | 0.857 ± 0.076 (95% CI [0.78, 0.934]) | 0.685 ± 0.093 (95% CI [0.591, 0.778]) | 0.622 ± 0.075 (95% CI [0.546, 0.697]) | 0.7377 |
Algo. Name | LLM | Text Split Methods | Retrieval Methods | Answer Relevancy | Faithfulness | Context Recall | Context Precision | Average |
---|---|---|---|---|---|---|---|---|
algo13 | openchat:7b-v3.5-q6_K | Recursive | MultiQuery (dense) | 0.859 ± 0.067 (95% CI [0.791, 0.927]) | 0.817 ± 0.075 (95% CI [0.741, 0.892]) | 0.78 ± 0.094 (95% CI [0.686, 0.874]) | 0.498 ± 0.079 (95% CI [0.419, 0.578]) | 0.7388 |
algo14 | openchat:7b-v3.5-0106-fp16 | Recursive | MultiQuery hybrid (dense + bm25) | 0.849 ± 0.071 (95% CI [0.777, 0.921]) | 0.797 ± 0.083 (95% CI [0.713, 0.881]) | 0.814 ± 0.086 (95% CI [0.728, 0.901]) | 0.501 ± 0.075 (95% CI [0.426, 0.577]) | 0.7408 |
Algo. Name | Embedding Time (s) | Retrieval Time (s) | Generation Time (s) | Total Latency (s) |
---|---|---|---|---|
algo-13 | 0.0257 | 3.0356 | 8.732 | 11.7933 |
algo-14 | 0.0184 | 6.1295 | 9.1957 | 15.3435 |
Algo. Name | Embedding Time (s) | Retrieval Time (s) | Generation Time (s) | Total Latency (s) |
---|---|---|---|---|
algo-13 | 0.0185 | 0.0045 | 1.3331 | 1.356 |
algo-14 | 0.02071 | 0.0077 | 1.58391 | 1.61232 |
algo-15 | 0.0177 | 0.5505 | 2.3262 | 2.8944 |
algo-16 | 0.0144 | 0.5663 | 1.4342 | 2.0149 |
algo-17 | 0.0189 | 0.0063 | 2.3132 | 2.3384 |
algo-18 | 0.0339 | 0.0244 | 1.0605 | 1.1188 |
algo-19 | 0.0377 | 0.0258 | 1.4223 | 1.4858 |
algo-20 | 0.0334 | 0.6657 | 1.4803 | 2.1794 |
algo-21 | 0.0337 | 0.7902 | 1.8809 | 2.7048 |
algo-22 | 0.022 | 2.7563 | 5.4991 | 8.2774 |
algo-23 | 0.0115 | 6.1369 | 11.2502 | 17.3986 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Khasanova Zafar kizi, M.; Suh, Y. Design and Performance Evaluation of LLM-Based RAG Pipelines for Chatbot Services in International Student Admissions. Electronics 2025, 14, 3095. https://doi.org/10.3390/electronics14153095
Khasanova Zafar kizi M, Suh Y. Design and Performance Evaluation of LLM-Based RAG Pipelines for Chatbot Services in International Student Admissions. Electronics. 2025; 14(15):3095. https://doi.org/10.3390/electronics14153095
Chicago/Turabian StyleKhasanova Zafar kizi, Maksuda, and Youngjung Suh. 2025. "Design and Performance Evaluation of LLM-Based RAG Pipelines for Chatbot Services in International Student Admissions" Electronics 14, no. 15: 3095. https://doi.org/10.3390/electronics14153095
APA StyleKhasanova Zafar kizi, M., & Suh, Y. (2025). Design and Performance Evaluation of LLM-Based RAG Pipelines for Chatbot Services in International Student Admissions. Electronics, 14(15), 3095. https://doi.org/10.3390/electronics14153095