A Comparative Performance Analysis of Locally Deployed Large Language Models Through a Retrieval-Augmented Generation Educational Assistant Application for Textual Data Extraction

Mishra, Amitabh; Brahmanapally, Nagaraju

doi:10.3390/ai6060119

Open AccessArticle

A Comparative Performance Analysis of Locally Deployed Large Language Models Through a Retrieval-Augmented Generation Educational Assistant Application for Textual Data Extraction

by

Amitabh Mishra

^*

and

Nagaraju Brahmanapally

Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA

^*

Author to whom correspondence should be addressed.

AI 2025, 6(6), 119; https://doi.org/10.3390/ai6060119

Submission received: 16 May 2025 / Revised: 2 June 2025 / Accepted: 4 June 2025 / Published: 6 June 2025

(This article belongs to the Topic Advanced Development and Applications of AI-Generated Content (AIGC))

Download

Browse Figures

Versions Notes

Abstract

Background: Rapid advancements in large language models (LLMs) have significantly enhanced Retrieval-Augmented Generation (RAG) techniques, leading to more accurate and context-aware information retrieval systems. Methods: This article presents the creation of a RAG-based chatbot tailored for university course catalogs, aimed at answering queries related to course details and other essential academic information, and investigates its performance by testing it on several locally deployed large language models. By leveraging multiple LLM architectures, we evaluate performance of the models under test in terms of context length, embedding size, computational efficiency, and relevance of responses. Results: The experimental analysis obtained by this research, which builds on recent comparative studies, reveals that while larger models achieve higher relevance scores, they incur greater response times than smaller, more efficient models. Conclusions: The findings underscore the importance of balancing accuracy and efficiency for real-time educational applications. Overall, this work contributes to the field by offering insights into optimal RAG configurations and practical guidelines for deploying AI-powered educational assistants.

Keywords:

large language models; Generative Artificial Intelligence; Retrieval-Augmented Generation; natural language processing

1. Introduction

The landscape of higher education across the world is characterized by a complex array of institutions. Ranging from large universities with very large enrollments to smaller colleges that often face challenges in achieving their enrollment targets, these educational entities are shaped by their unique histories, cultural contexts, and evolving needs. Over time, these factors have led to the emergence of shared characteristics among institutions while also fostering significant diversity. This extensive variety of institutional types exemplifies best practices in many respects, yet it can also pose challenges to the implementation of nationwide reforms when required.

To achieve the objectives set by educational institutions, it is essential to integrate academic advising into the academic framework of the institution. Effective academic advising plays a vital role in aligning students’ academic opportunities with their future aspirations. It offers a supportive structure that assists students in navigating challenges and fosters personal development, which is essential for both academic achievement and career readiness.

1.1. Importance of Academic Advising in Student Success and Institute Metrics

Academic advising has been found to be essential to student success. It includes objectives such as assisting students with degree planning, facilitating experiential learning opportunities, and providing techniques for time and stress management, along with various other facets of college life. This process not only enhances students’ academic success but also fosters a supportive and inclusive atmosphere within the department. Advisors must recognize each student as a distinct individual with specific goals, interests, and challenges.

In higher education, excellent academic advising has also been connected to student loyalty and happiness, which improves student retention [1]. The caliber of advising has a significant influence on a student’s academic achievement [2]. It is important that academic advisors maintain positive rapport and interaction with advisees so that the academic progress of the students is ensured. As the demand by number for advisors is huge, and the expectations and responsibilities of the advisors are substantial, advising is a time-intensive task that needs to be handled sensitively.

1.2. Challenges Faced in Academic Advising

Higher education students encounter many obstacles when seeking academic advising, such as the advisers’ limited availability, scheduling conflicts, and inconsistent advice accuracy, which often ends in advisers providing inadequate or erroneous information [3,4,5]. From the academic advisors’ perspective, the issues reported involve understanding the technological aspects associated with electronic student reports, advising as a task often performed in solitude, unclear demands on faculty advisors, and workload imbalance [6].

Using faculty advisers in higher education institutions has not proved effective, as they may not always have the time or attention to devote to general or individualized academic advising. With 20 teaching hours a week, research activities, and other administrative responsibilities, the advisor’s schedule leaves little time for one-on-one interactions with 25 to 30 advisees [3,7,8,9].

Current research examining the workplace health of higher education professionals indicates that academic advisers are probably also experiencing high levels of burnout involving emotional, mental, or physical fatigue [10,11]. Academic advisor burnout may also be caused by administrative demands, emotional labor, compassion fatigue, a heavy workload, and a lack of institutional resources and support [12]. The number of students the advisors advise each academic year, the advisors’ tenure in the role, the type or types of students they advise most frequently, and the advisors’ location are all factors associated with advising. Institutions may suffer because of advisors’ burnout and attrition [13], and there are strong correlations between burnout and employee attrition among academic advisors [14]. The factors mentioned above have been used in research related to the burnout of academic advisors or linked to the effectiveness of advising for a long time [15,16,17,18]. According to the study by [19], 39% of academic advisors indicated they were inclined to hunt for another role, and this staff group had the highest turnover intention percentage.

1.3. Ways of Addressing the Issues in Academic Advising

Advisers were more likely to feel appreciated and stay in their role if they thought the leadership and organizational culture were encouraging, open, and empowering. Fostering a positive atmosphere can lower advisors’ burnout and eventual turnover by implementing transformational leadership practices that emphasize personalized support and employee motivation. Advisors’ dedication and sense of belonging can be enhanced by a supportive workplace culture [20]. The present work can be an important step in supporting academic advisors in a major way to quickly arrive at appropriate suggestions and directions for students, thus helping address all the issues discussed above.

1.4. AI for Academic Advising Support

Recent advances in artificial intelligence (AI) involving language models can help advisors immensely with their quick information requirements, reducing the turnaround and wait times experienced by students.

This growing reliance on AI for academic assistance has prompted increasing interest in generative AI applications in education. Recent studies have shown the potential of large language models (LLMs) integrated with Retrieval-Augmented Generation (RAG) to enhance access to academic support.

For instance, Golla [21] demonstrated how RAG-based chatbots could enhance student engagement by adapting responses to learner profiles, while Wijaya and Purwarianti [22] implemented a RAG-supported intelligent tutoring system tailored for programming education. More targeted toward academic advising, QuIM-RAG [23] and SYLLABUSQA [24] introduced domain-specific QA datasets based on real course materials but stopped short of deploying advisor-supportive chatbots in practical university settings.

These studies illustrate the foundational potential of generative AI in educational support; however, a critical research gap remains in applying fully open-source, locally deployed models to real-time course advising tasks. This work addresses that gap by designing and benchmarking a student-facing and advisor-supportive chatbot based on real course catalog data.

With their unparalleled ability to comprehend and produce human language, large language models (LLMs) have emerged as crucial tools in a wide range of applications, from chatbots and virtual assistants to content production and translation services, and they have the potential to revolutionize how we engage with technology.

The publication of a study involving “attention” as a concept in the field of AI marked an ideological shift in the field of natural language processing [25]. The transformer model, an architectural breakthrough that offered a previously unheard-of method for sequential language problems like translation, was presented in this seminal article.

Previous AI models that processed sequences serially were fundamentally different from the transformer model, as it analyzed many input sequence segments concurrently, assessing each segment’s significance according to the job [26,27,28]. By addressing the intricacy of long-range connections in sequences, this novel processing allowed the model to extract the essential semantic information required for a job [29,30].

Almost all the most advanced generative LLMs use some variation of the transformer’s original architecture because it was such a significant breakthrough [31]. Autoregressive transformers provide remarkable efficiency in language-related tasks by modeling intricate sequential dependencies via parallelizable self-attention [32,33]. When GPT-3 (Generative Pretrained Transformer-3) surpassed classical and modern techniques in various language-related challenges and revealed an extraordinary grasp of human language, it brought to light the promise of these models [34]. Pretrained models, such as Claude or GPT-4, have shown remarkable human-like text production and generalization across natural language tasks [35,36]. The Llama 2 model from Meta AI developed in 2023 was a member of a suite of models that prioritized accessibility and efficiency, enabling high-performance language modeling at a lower resource cost. The goal of this paradigm was to enable the AI community to conduct a wider variety of research and application development.

Notwithstanding concerns related to possible abuse and ethical challenges, transformers are establishing themselves as the top contenders in multimodal generation and applications involving language modeling. Traditional natural language processing (NLP) and natural language generation (NLG) have been boosted by LLMs, such as Open AI’s GPT series [35,37]. These models are exceptional in producing language content that is relevant, logical, and unbelievingly human-like.

Automated study tools that can support individualized learning experiences and automatically produce content that caters to different learning styles are available through generative AI-enhanced educational platforms today. Although generative and discriminative models both have unique advantages and disadvantages, determining when to use which paradigm necessitates weighing several important considerations.

2. Literature Survey

Recent advancements in RAG have enabled the development of intelligent systems capable of delivering context-aware responses, particularly in specialized domains such as education and industry-specific documentation. Multiple studies have explored the application of RAG pipelines in chatbot systems, each with a unique focus on domain-specific customization, performance optimization, and model evaluation. However, only a select few have focused on academia.

Taha in [37] used an XML-user-based Collaborative Filtering (CF) system for course recommendations. The article in [38] suggested the use of AI to solve campus problems. Al-Hunaiyyan et al. [39] performed a study using a conceptual model that can provide intelligent academic advice using adaptive, knowledge-based feedback. Lucien [40] presented an artificial intelligence-enabled chatbot for student advising with the AI technology available then. Akiba and Fraboni [41] suggested a system in which the questions were typed into the free version of ChatGPT. None of these previous attempts toward technologically supported academic advising systems have focused on the development of a locally deployed application on the same or similar lines as the authors’ approach presented here because the technologies used were not available then or were in a nascent stage at that time.

Among these, some studies have targeted educational use cases for RAG. Golla [21] proposed a RAG-based chatbot that adapts responses to enhance learner engagement, while Wijaya and Purwarianti [22] integrated history-aware retrievers into a tutoring system to support programming instruction. Although promising, these systems focus on generalized or content-based tutoring rather than structured academic logistics.

Closer to goals like ours, QuIM-RAG [23] and SYLLABUSQA [24] built QA datasets from course catalogs and syllabi but emphasized dataset curation and retrieval quality over deployment or advisor support. Our work complements these efforts by delivering a functional, scalable chatbot powered by open-source models, specifically tuned for course registration queries, advising support, and institutional integration.

In the automotive domain, Ref. [42] optimized PDF (Portable Document Format) parsing and retrieval strategies tailored for industry documents, proposing self-RAG agents and context compression techniques. Their work emphasizes backend RAG architecture rather than front-end user interaction. As mentioned previously, Ref. [21] leveraged models such as Gemma2, Mistral, and Llama 3.2 to deliver adaptive educational content based on learner profiles. However, their scope extended beyond course-specific queries to personalized learning pathways.

Several studies have benchmarked open-source and proprietary LLMs for educational Q&A (Question and Answer) tasks. For instance, Ref. [43] conducted a comparative evaluation of GPT-3.5 turbo, Gemini Pro, and Llama 3 using the RAGAS framework, showing the superiority of GPT-3.5 turbo in generating accurate responses. As pointed out previously, Ref. [23] introduced an interactive RAG-based tutoring system in the programming domain, integrating history-aware retrievers and evaluating user interaction through qualitative scoring. Despite notable progress, these studies either target broad educational content or focus on limited domains like programming or high school curricula [44,45].

More aligned with our objective are works like the previously mentioned systems in QuIM-RAG [23] and SYLLABUSQA [24], which constructed domain-specific QA datasets from course catalogs and syllabi, respectively. QuIM-RAG emphasized enhanced retrieval mechanisms using inverted question matching, while SYLLABUSQA curated a diverse set of real-world QA (Question and Answer) pairs, fine-tuned Llama models, and introduced Fact-QA for answer factuality assessment. However, both studies prioritized dataset development and evaluation over system deployment and real-time usability.

Furthermore, institutional case studies like BARKPLUG V.2 [46] and Unimib Assistant [47] demonstrate the growing interest in using RAG-based systems for campus-specific information retrieval. Although these systems provide a student-focused interface, challenges such as broken links, limited scalability, or reliance on proprietary models like GPT-4 limit generalizability and cost-effectiveness remain.

Despite this extensive body of work, a notable research gap exists in developing a scalable, student-friendly RAG-based chatbot tailored for course registration queries using fully open-source LLMs and real university catalog data. Our proposed system distinguishes itself in the following key ways:

The authors constructed the knowledge base from actual course catalog data spanning 13 subjects from the University of West Florida, formatted into structured CSV files for accurate context retrieval.
The authors integrated open-source LLMs such as Phi-4 (14.7B), Llama 3 (8B), Llama 3.1 (8B), and Llama 3.2 (3.2B) into a RAG pipeline, ensuring cost-effective and locally deployable alternatives to commercial APIs.
The authors’ chatbot directly addresses common, yet often overlooked, “low-stakes” student queries, which many students feel uncomfortable asking academic advisors, thus improving student confidence and experience.
The system additionally acts as a support tool for academic advisors, allowing them to focus on strategic advising by reducing time spent on repetitive course-related questions.

While performance parameters such as lexical similarity, mean cosine similarity, and CLIP score have previously been evaluated for some of these models and benchmarks such as adversarial NLI and stereotype measurements have been developed [47,48], the authors’ work presented in this article is a novel work of its own type. To the best of the authors’ knowledge, this is the first ever work that combines domain-specific university course data, open-source LLMs, and RAG pipelines to build a practical and deployable academic chatbot assistant. The contributions of the present work are not limited to improving model performance but also extend to real-world usability, system scalability, and institutional relevance, addressing both student-centric and advisor-support perspectives.

3. Transformer Networks

While transformer networks and retrieval-augmented generation (RAG) frameworks are well-established in the field of natural language processing, we provide a focused overview in this section to highlight how these architectures directly address the specific challenges of educational advising.

Academic advising involves the frequent retrieval of factual, structured information (e.g., course schedules, prerequisites, and instructor names) from domain-specific sources. Unlike creative writing or open-ended chat tasks, these advising tasks require high factual precision and context retention, making the technical underpinnings of RAG critical to application success. Often, the turnaround time is of prime importance, as advising requests are frequently made in bulk and require decisions before close deadlines, overwhelming advisors. A support system such as ours goes a long way in enabling quick searches and responses for advising work. This section lays the groundwork for understanding how the proposed system leverages transformer-based LLMs in conjunction with retrieval mechanisms to meet the demands of real-time, semantically grounded advising support.

Numerous contemporary language-processing systems are built on the foundation of the original transformer design, which was initially introduced by Vaswani et al. in 2017 [25]. Since it forms the basis of the GPT series of models and numerous other cutting-edge generative techniques, the transformer might be regarded as the most significant development in the field of GenAI.

The self-attention mechanism, a novel method that captures intricate interactions between various pieces inside an ordered data stream, is at the heart of transformer design. Depending on the level of granularity selected for tokenization, these components—known as tokens—represent words in a sentence or characters in a word.

This architecture’s attention concept allows a model to concentrate on the important characteristics of the incoming data while maybe ignoring less important ones. This technique enhances the model’s comprehension of sentence context and word relative value.

The encoder and the decoder, the transformer’s two primary sections, each have several layers of self-attention processes. The decoder concentrates on the encoder’s output, while the encoder determines relationships between various points in the input sequence. It uses a type of self-attention known as masked self-attention to avoid considering outputs that it has not yet produced.

Determining the degree of concentration on various input components is mostly dependent on the attention weights that are calculated using the scaled dot product of the query and key vectors. Furthermore, the model may focus attention on several data points at once thanks to multi-head attention.

Finally, the model uses a technique called positional encoding to preserve the data’s sequential order. This technique ensures that the model maintains the original order of data throughout its processing, which is essential for jobs requiring a comprehension of sequence or temporal dynamics.

To address generative tasks, transformer mechanisms can be used in conjunction with other cutting-edge methods. Different methods for managing text and image production have resulted from this progression.

3.1. Retrieval-Augmented Generation Frameworks

The advent of LLMs has revolutionized the development of query-driven retrieval systems by significantly improving their accuracy and responsiveness [42]. In recent years, retrieval-augmented generation (RAG) frameworks have emerged as a powerful approach to integrating extensive knowledge bases with generative capabilities, enabling applications ranging from automotive document processing [42] to educational chatbots [21,45,46,49].

3.1.1. Choice of RAG

How to get LLMs to comprehend their exclusive enterprise data is a major issue that businesses encounter when implementing LLMs in their workplace. The most popular method for adding enterprise data to LLMs is RAG. For instance, businesses employ RAG to provide LLMs with domain-specific knowledge gleaned from user manuals or support articles to guarantee chatbots driven by LLMs are providing accurate, pertinent responses.

RAG frameworks use LLMs to create more accurate, data-driven textual and grounded responses by combining information retrieval and generation. Applications that require access to certain knowledge bases and Q&A systems find this combination especially helpful, as it was in our case. RAG frameworks are adaptable to a range of applications since they can be customized to fit certain domains or internal knowledge bases of businesses.

LLMs can access and use external data through RAG, which helps them produce more thorough and context-aware responses, but they often suffer from knowledge cutoffs and hallucinations. RAG systems, when used with LLMs, lower the possibility of such “hallucinations”, which can produce inaccurate, often misleading information, thus delivering more accurate and trustworthy responses by accessing pertinent information from outside sources. This is particularly vital in academic advising contexts, where misinformation (e.g., stating the wrong prerequisite or course format) can have serious consequences, such as loss of time, money, and other resources, in addition to delayed graduation, affecting graduation metrics and other state performance indicators. RAG reduces this risk by anchoring responses in verifiable institutional data.

Because RAG frameworks may be customized to fit the domains or internal knowledge bases of businesses, they are adaptable to a wide range of applications. This was another reason behind the choice of the RAG framework for our chat application.

3.1.2. Overview of RAG

Recent advancements in LLMs have significantly improved their ability to generate human-like responses. RAG combines retrieval-based and generative approaches to enhance response accuracy by dynamically incorporating external knowledge sources.

RAG’s main idea is to improve LLM outcomes by adding pertinent context from outside data sources. These resources ought to offer precise and validated data to support model results. Furthermore, by collecting few-shot samples during inference time to direct generation, RAG can potentially take advantage of the few-shot technique. This method only obtains pertinent examples when required, eliminating the need to store examples in the prompt chain. It offers grounding, “few-shot” learning, chaining, and structure. In the context of university course catalogs, this few-shot capability and retrieval grounding allow the model to dynamically respond to diverse student queries without relying on brittle, rule-based systems or maintaining extensive manual FAQs, which might still not be comprehensive or applicable to every single student. The RAG approach is essentially a combination of various quick engineering strategies. The RAG framework consists of two main components:

Retrieval Module: This component searches for a knowledge base (e.g., vector database, document store) to fetch relevant information based on the user query.
Generation Module: The retrieved context is then provided to a large language model, which uses it to generate more precise and contextually relevant responses.

By integrating retrieval into the generative process, RAG enables models to provide factual, up-to-date, and contextually accurate answers without requiring direct memorization of all possible knowledge.

3.1.3. The RAG Pipeline

RAG offers several significant advantages. First, the indexed external data overcomes the statelessness of the LLM by functioning as a type of memory, like the chaining technique. Second, because instances are filtered and only supplied when requested, this memory can quickly expand, exceeding the constraints of the model context window. Also, RAG makes it possible to generate accurate and dependable content in ways that would otherwise be impossible. The workflow of a RAG system is illustrated in Figure 1 and consists of the following key stages:

Document Ingestion and Preprocessing: Raw text data (from CSVs, PDFs, databases, or web pages) is collected and cleaned. Text is divided into chunks of manageable size to enhance retrieval efficiency.
Embedding and Indexing: Each text chunk is converted into a vector representation using an embedding model (e.g., Ollama’s mxbai-embed-large:334M parameters, OpenAI’s text-embedding-ada-002, BERT, or SBERT). These embeddings are stored in a vector database (e.g., ChromaDB, FAISS, Pinecone) for efficient similarity searches.
Query Processing and Retrieval: When a user submits a query, it is converted into a vector using the same embedding model. A similarity search is performed in the vector database to retrieve the most relevant text chunks.
Prompt Augmentation: The retrieved text chunks are combined with the user’s query to form an enriched prompt for the LLM.
Response Generation: The LLM generates a response based on the retrieved information, ensuring greater factual accuracy and contextual relevance.

Figure 1 illustrates the process of retrieval-augmented generation, highlighting the interaction between document storage, retrieval, and the language model’s response generation.

4. The Test Case and Its Choice

This article explores the application of RAG-based systems in the context of a university course catalog, aiming to develop a chatbot capable of addressing student queries regarding course details, schedules, professor names, and other vital academic information. University course catalogs offer a wealth of information that is both rich and semi-structured, presenting challenges in data extraction and real-time retrieval. Most of the time, this information is required by current or prospective students and is frequently accessed by student advisors, who use their prior knowledge and experience in retrieving this information to advise students. Getting pertinent information out of data presented in official documents or web portals, which does not follow any single, standard design, or well laid out rules, is often very challenging and time consuming. Also, the number of times such information is needed is so high that student advisors are always very busy. Irrespective of whether students try to get this information or advisors need it for advising the students, a lot of time is involved, and often lost, before correct and crucial coursework-related decisions can be taken by the students. Most of the time, getting correct and accurate information is important for time-critical coursework-related decisions. The use of the latest advances in AI could help this use case immensely by making a freely available application available for such information gathering.

While the problem of accessing academic information may appear to be primarily one of productivity, the real-world implications of misinformation can be significant. For instance, if students receive incorrect or incomplete responses about prerequisites, schedule conflicts, or course formats (e.g., online vs. in-person), they may enroll in inappropriate courses, delay their graduation timeline, or miss time-sensitive opportunities. Similarly, academic advisors—already constrained by high advising loads—may inadvertently propagate errors if their tools lack accuracy or contextual awareness.

Thus, even marginal differences in model accuracy become meaningful in practice. For example, a model that misstates the existence of a prerequisite only 5% of the time could still mislead dozens of students per semester. Consequently, our model comparison is driven not merely by academic curiosity but by the need to identify configurations that balance semantic relevance, response time, and resource efficiency—criteria that directly impact user trust and system usability.

These practical concerns underscore the necessity of benchmarking model variants across both accuracy and latency metrics to support responsible deployment in high-stakes educational contexts.

In response to these challenges, the comparative performance analysis presented in this article evaluates multiple LLMs—specifically, models with varying parameters, context lengths, and embedding sizes—to determine the optimal configuration for enhancing the performance of course catalog chatbots [22,43]. For instance, while larger models such as Phi-4:14.7B demonstrate superior relevance in response generation [42,50], they also incur a higher computational overhead [46]. Conversely, smaller models like Llama 3.2:3.21B tend to provide faster responses with acceptable relevance scores [24]. While our application focuses on a specific use case, the results obtained by us could be generic and applied across several similar use cases in the world.

4.1. Choice of Models Used in the Performance Analysis

A significant number of parameters, which are numerical values used to describe connections between nodes and layers in the neural network architecture and assign weight, are commonly seen in LLMs. The weights of different values can be changed by adjusting the parameters, which alters the model’s interpretation of different data points, words, and relationships as well as what it prioritizes in the prompt and data.

LL models are remarkably fast in generating texts or images after a user prompt. LLMs anticipate the next word in an order using parameters. This means that they guess the word that is most likely to come after the prompt, followed by the word that is most likely to follow its first predicted word, and so on, until the model thinks it has completed the most likely pattern. In a similar manner, it creates visuals by anticipating the image that will come after user prompt description.

OpenAI-developed LLMs, such as GPT-3 and GPT-4, can process billions of words per second. GenAI interactions seem human-like because of their quick response times, seeming sophisticated comprehension, and fluent use of spoken language.

The selection criteria for the models used in the presented work were grounded in several key factors:

Context Length: Since university course descriptions and schedules contain extensive structured information, models with longer context windows were prioritized to ensure effective retrieval-augmented generation.
Embedding Dimensions: Higher embedding dimensions facilitate richer vector representations, enhancing the model’s ability to understand nuanced course details and improve response quality.
Computational Efficiency: Given the constraints of real-time query processing in chatbot applications, models with an optimal balance between performance and memory footprint were considered [46].
Integration with Frameworks: Since the chatbot implementation leveraged LangChain for retrieval and prompt augmentation, the selected models needed to be seamlessly integrated into this framework [21,22].

For several applications, small language models would be sufficient, depending on the amount of data and the computation requirements. The recent literature provides a robust foundation for our approach. Prior studies have demonstrated the effective application of RAG techniques across various domains, including automotive industry document processing [42], personalized educational content delivery [21], and real-time question-answering systems [24,43]. Moreover, comparative analyses and surveys [22,23] have highlighted the trade-offs between model size, relevance accuracy, and response time—an aspect that is central to our research.

Table 1 presents the models chosen for experimentation, highlighting their key architectural features.

4.2. Rationale for Model Selection

The capabilities of different LLM models vary, depending on the task at hand. For starters, some models excel at text-based applications and could be different from those used for image, video, or sound-based applications. Certain LLMs can be very creative at text generation, and certain others can excel at understanding complex information and summarizing it. The overall response and the latency encountered while obtaining replies to user queries can also be important factors affecting the choice of LLM models for text-based applications.

The details of the models tested and compared by the authors are as follows:

4.2.1. Phi-4 (14.7B Parameters, 16K Context)

With a size of 14B parameters, Phi-4 is Microsoft’s small language model that specializes in complicated reasoning and may be used in fields other than traditional language processing, like arithmetic. Because Phi-4 uses synthetic datasets, curates organic data, and incorporates post-training improvements, it can be used for reasoning linked to mathematical challenges.

For both classic machine learning and generative AI applications, Phi-4 offers customers a wide range of tools to assist enterprises in measuring, mitigating, and managing AI risks throughout the AI development lifecycle. Using both bespoke and built-in metrics, developers may iteratively evaluate the safety and quality of models and applications to guide mitigations.

Additionally, Phi users have access to content safety features like groundedness recognition, protected material detection, and prompt shields. With a single API, developers can effortlessly include these features into their applications and use them as content filters with several language models. With the use of real-time notifications, developers may keep an eye on their application for data integrity, safety, and quality, as well as adversarial prompt attacks, and take rapid corrective action.

The model was selected for its higher parameter count, which enhances response coherence and factual accuracy. Furthermore, the model offers a moderate context window (16K tokens), suitable for processing multi-turn queries on course prerequisites, schedules, and faculty details.

4.2.2. Llama Models

Meta created the open-source GenAI models known as Llama models, which are sets of large language models with between 8 billion and 70 billion parameters. There are refined variants of these models that are best suited for conversation applications. For a range of uses, Llama models can offer adaptable solutions. They make GenAI a flexible tool for both programmers and non-developers by enabling customers to deploy its features without requiring a deep understanding of code. The Llama versions are among a family of models that prioritize availability and productivity, enabling powerful language modeling at a lower resource cost. The goal of Llama models is to enable a wider variety of AI studies and application production. The authors used the Llama models below to develop their application.

Llama 3 (8B Parameters, 8K Context)

This model was chosen for its balance between performance and resource efficiency. Despite an 8K token context window, it provides strong retrieval performance when coupled with LangChain.

Llama 3.1 (8B Parameters, 131K Context)

The authors included this model due to its extensive context length (131K tokens), making it ideal for handling long-form retrieval tasks where multiple courses or full curricula need to be compared. The model maintains efficient computation despite a large context window, ensuring practical deployment.

Llama 3.2 (3.21B Parameters, 131K Context)

The model was selected as a lighter alternative with an extremely long context window, enabling retrieval over large course catalogs while minimizing the computational overhead.

By evaluating these models, the comparative performance analysis aimed to determine which LLM configuration provides the most accurate and context-aware responses for university course-related queries while also trying to develop a benchmarking mechanism for the choice of an LLM model for similar use cases. The comparative analysis of model performance is discussed in subsequent sections.

4.2.3. Infrastructure Configuration and Environment Setup

The evaluations were performed using a system implemented on a virtual machine running Ubuntu 24.04.1 LTS as its operating system. The system came with a total of 62 GB (Giga Bytes) of memory, of which 61 GB was available and 28 virtual CPUs, but no dedicated GPU. All model inferences were executed entirely on a CPU. The total available storage was 98 GB, with 56 GB free during testing. Latency results reported in this study reflect CPU-only performance. The software environment for creating the application used Python version 3.12.3 and Ollama version 0.3.13. Given this configuration, larger models like Phi-4:14.7B exhibited greater response times (averaging ~38 s). In GPU-accelerated environments, we expect latency to be significantly lower. Our setup represents a realistic scenario for institutions deploying LLM-based tools on standard CPU infrastructure without access to GPUs. It is obvious that the addition of GPUs to the system will enhance the performance of our system and improve its performance metrics.

5. Methodology of Evaluation

This comparative performance analysis presents a RAG framework for an intelligent chatbot that facilitates university course inquiries by integrating structured data retrieval with generative language modeling tested on multiple models. Unlike conventional chatbots that rely solely on pretrained language models, the proposed system leverages a hybrid approach, where a retrieval mechanism first extracts relevant course information from a structured knowledge base and a generative model subsequently formulates a precise response. This methodology ensures that the system provides accurate, up-to-date, and contextually relevant information, thereby mitigating the risk of hallucinated or incorrect responses. Implementation follows a multi-stage process comprising data collection and extraction, embedding and storage, retrieval and augmentation, and response generation. The user query is transformed into embeddings, which are then used to search for relevant information in the vector database. The retrieved information, along with the original user query, is incorporated into the prompt to generate a response. This process is depicted in Figure 2.

5.1. Data Collection and Extraction

The first stage of the pipeline involves the acquisition and preprocessing of course-related information from the university’s course catalog. The course catalog data, consisting of key attributes, such as course title, subject, course number, section number, credit hours, course reference number (CRN), term, instructor, meeting times, type, building, room, start date, end date, campus, seats remaining, schedule type, and additional course attributes, is critical for the research. However, the university did not provide a publicly available Application Programming Interface (API) to directly access this information. The course data displayed on the university’s course search website is sourced from a combination of multiple databases and tables, which have restricted access. Consequently, data extraction from this source required the development of a custom web scraping script.

To gather the necessary data, a manual search was conducted for each subject (e.g., COP for Computer Programming) to list all the courses available for a given semester (e.g., Spring 2025). The Hyper Text Markup Language (HTML) content of the <tbody> tag, which contains the course details in individual rows (<td>), was then extracted manually and saved into .html files.

Subsequently, a Python script was developed to automate the extraction process from the saved HTML files. This script utilizes BeautifulSoup, a powerful library for parsing HTML content, to extract the relevant course details. The extracted data includes attributes such as course title, subject, course number, section number, credit hours, CRN, term, instructor, meeting times, type, building, room, start date, end date, campus, seats remaining, schedule type, and additional course attributes. The processed data is then saved into CSV files for further analysis.

The entire data collection process is illustrated in Figure 3. This approach ensures a systematic and efficient method of data extraction, eliminating manual repetition and enabling scalability for future semesters.

5.2. Data Embedding and Storage

Once the structured course data was extracted and preprocessed, the next stage involved transforming the text-based information into a high-dimensional numerical representation through an embedding model. This comparative performance analysis utilized mxbai-embed-large, a Bidirectional Encoder Representations from Transformers-based (BERT-based) encoder optimized for semantic representation learning, to generate dense vector embeddings for each course entry. The embedding process captures contextual relationships between course descriptions and related attributes, facilitating efficient similarity-based retrieval during query resolution.

This comparative performance analysis utilized mxbai-embed-large, a BERT-based encoder optimized for semantic representation learning, to generate dense vector embeddings for each course entry. This model was selected primarily for its compatibility with Ollama, the local hosting environment used for all LLMs in this study. Its seamless integration allowed us to maintain a consistent, fully offline retrieval-augmented generation (RAG) pipeline with no reliance on external APIs or cloud infrastructure.

While other well-known embedding models, such as SBERT or OpenAI’s text-embedding-ada-002, are widely benchmarked, they were not evaluated here. SBERT was not considered due to a lack of native support in our deployment stack, and OpenAI embeddings were deliberately excluded to preserve the cost-free, private, and reproducible nature of the system. The use of mxbai-embed-large ensured a smooth local deployment pipeline with effective semantic retrieval performance in our domain-specific context.

The embeddings were subsequently stored in ChromaDB, a high-performance vector database specifically designed for approximate nearest neighbor (ANN) searches. ChromaDB provides efficient indexing mechanisms that enable low-latency retrieval of semantically relevant course details based on user queries. The decision to employ a vector database was driven by its scalability, query efficiency, and ability to support real-time information retrieval, thereby ensuring that the chatbot can dynamically fetch the most pertinent course details without compromising performance.

5.3. Retrieval and Augmentation

In the retrieval phase, the system employs a semantic search mechanism to fetch the most contextually relevant course details corresponding to the user’s query. When a query is received, the chatbot first converts the user’s input into an embedding vector using the same mxbai-embed-large model. This vector representation is then utilized to perform a similarity search within the ChromaDB vector store using the similarity_search_with_score method, which applies a k-nearest neighbors (k-NN) algorithm (k = 5, selected empirically) based on cosine similarity. The value of k was chosen after preliminary testing, where it offered a strong balance between retrieval depth and prompt length limitations. No further hyperparameter tuning was required, as the default ChromaDB behavior with cosine distance yielded consistent and semantically appropriate results across the test set. This semantic search approach improves retrieval robustness, allowing the system to return relevant context even when student queries are paraphrased, loosely structured, or missing exact field terms—a common limitation in traditional keyword-based methods.

Following retrieval, the extracted course details are formatted and incorporated into a predefined prompt template before being passed to the generative language model. This augmentation step is crucial in ensuring that the language model operates strictly within the boundaries of the retrieved knowledge, thereby preventing hallucination or speculative responses. The prompt explicitly instructs the model to base its response exclusively on the retrieved context, with a built-in safeguard that prompts the model to return a fallback response if insufficient information is available. The structured prompt is formulated as follows:

“You are a university course expert with knowledge of course details such as course name, description, schedule, professor names, and mode of instruction. Answer questions based only on the context provided and make no assumptions beyond the available information. Provide a direct response without extraneous details. If the context is insufficient, state: ‘Sorry, I do not have enough context to answer your question’.”

This prompt template ensures that the responses generated remain factual, precise, and aligned with the retrieved data, minimizing the risk of erroneous or misleading outputs.

5.4. Response Generation

Once the relevant course details are retrieved and incorporated into the structured prompt, the LLM processes the input and formulates a response that adheres strictly to the provided context. This approach ensures that the chatbot operates within a controlled knowledge environment, where responses are directly correlated with the retrieved course records rather than inferred from a generalized language model. The system’s zero-assumption policy further reinforces the integrity of the responses by explicitly rejecting queries for which sufficient context is unavailable.

For instance, if a student queries, “who teaches Systems Design II course?”, the retrieval module first fetches the relevant course entry from the database. If an instructor is listed in the dataset, the chatbot generates a response such as “John Doe (Primary) teaches the Systems Design II course.” Conversely, if no matching record is found, the chatbot returns, “Sorry, I do not have enough context to answer your question.” This systematic approach prevents the dissemination of incomplete or speculative information, thereby enhancing the chatbot’s credibility and reliability in academic advising contexts.

5.5. Types of Questions Tested on the Chatbot

The RAG element receives the user queries and uses them to execute a semantic search. It leverages embeddings from domain-specific data sources that are high in semantic data and highly defined before working on the prompt. RAG then uses the first prompt as a search query. Retrieval algorithms receive the query as input and use vector matching to identify the most pertinent snippets from the indexed data. RAG ranks the results according to semantic significance using a similarity metric. Finally, data from the obtained contexts is added to the initial prompt, which is then sent to the LLM to produce a response based on the outside information. The following are the types of questions that the authors used in their queries.

5.5.1. Instructor-Related Queries

Instructor-related queries are an important aspect of the university course catalog, as students often seek information about who will be teaching a specific course. These queries typically include questions regarding the name of the instructor and the courses they are teaching in a particular semester. For instance, queries such as “What are the courses taught by Dr. Doe?” or “Which professor is teaching ‘Programming Languages’ (COP 4556) in Spring 2025?” were used to assess the model’s ability to retrieve accurate instructor information. These queries were tested to ensure that the chatbot could correctly identify instructor names and correlate them with their respective courses in the catalog.

5.5.2. Course Availability and Format

These queries address students’ needs for information about the availability of seats on specific courses as well as whether the course is offered online, in-person, or in a hybrid format. Examples include questions like “Is the ‘Database Systems’ course offered online?”, “Is there a hybrid option for any courses?”, “List 5 courses that are taught online?”, and “Is ‘Intermediate Python Programming’ (COP 3456) available via Distance Learning?” The chatbot was tested on its ability to pull this information from the course catalog and provide accurate responses about course capacity and format.

5.5.3. Course Schedule and Location

Course schedule and location queries are crucial for students when planning their courses. These queries seek information such as time slots, building locations, and room numbers for a particular course. For example, questions like “Is ‘Data Structures I’ (COP 3002) available in the morning or in the afternoon?”, “Where will the ‘Secure Software Development’ course be taught?”, “What are the meeting times for the ‘Algorithm and Program Design’ course?”, and “What is the building location for the course ‘Intermediate Computer Programming’ taught by Dr. John Doe?” were used to assess how well the chatbot could pull this time-sensitive information.

5.5.4. Course Identification and Details

Course identification and detail queries are concerned with the fundamental attributes of courses, such as course title, description, course number, and credit hours. Students often seek information to help them decide whether a course aligns with their academic plans. Queries such as “What is the description of the course COP 6416?” and “Are there any courses that offer 1.5 credits?” were tested to evaluate the chatbot’s ability to extract and present course details. These questions are central to the functionality of the RAG chatbot, as they ensure the retrieval of basic course-related information from the catalog.

The diversity and ambiguity of student queries observed during initial prototype testing further motivated the use of RAG-based architecture. Many students do not phrase their questions using exact course titles or catalog terminology. Instead, they ask context-dependent, paraphrased, or multi-intent questions such as “What advanced courses do I need before taking AI?” or “Is Python taught online this semester, and who teaches it?”

Traditional keyword-based or rule-based systems struggle to accommodate this variability, often returning no results or irrelevant entries if the query phrasing fails to match exact field names. In contrast, RAG pipelines leverage semantic embeddings and generative modeling to return contextually relevant responses even for rephrased, indirect, or compound queries.

While our current dataset includes attributes such as the course title, schedule, instructor, and delivery mode, we acknowledge that prerequisite information—although publicly available—is not included in this version of the chatbot. Extracting this data requires a JavaScript-based pop-up to be dynamically triggered for each course and its content parsed, which created practical limitations during the scraping phase. We recognize the importance of prerequisite data in academic planning and aim to incorporate it in future releases through improved scraping pipelines or API access (if made available).

5.5.5. Course Recommendations and Campus-Based Queries

Course recommendation queries typically involve asking for advice on course selection based on prerequisites, academic goals, or available slots. Questions such as “What courses are recommended for a Computer Science major?”, “I am interested in algorithms, are there any related courses?”, or “What are the courses taught in the Pensacola campus?” were designed to test the chatbot’s ability to suggest courses based on available data. Additionally, campus-based queries that provide insights into campus-specific details (e.g., campus location) were included. These queries assess the chatbot’s ability to assist students in navigating the university’s course offerings based on their specific academic needs or logistical considerations.

6. Results and Discussion

This is an era that is becoming more and more AI-centered, but we should not ignore the intricate difficulties that lie ahead, including the potential misuse of AI tools, unpredictable implications, and the deep moral questions that underlie AI adoption. The development of generative AI offers possibilities as well as obstacles.

6.1. Analysis of Model Performance for Course Catalog Query Processing

The evaluation of Llama 3:8B, Llama 3.1:8B, Llama 3.2:3.21B, and Phi-4:14.7B (Table 2) reveals a clear trade-off between response relevance and latency in processing course catalog queries.

To evaluate the relevance and factual accuracy of model responses, a human assessment was conducted by two reviewers—the co-authors of this study, one of whom is a student researcher and the other a faculty member and advisor. Each evaluator independently rated the responses of all the models across a set of 25 representative queries.

Given that the chatbot’s outputs are intended to match structured, publicly available course catalog data (e.g., instructor names, schedules, course formats), our evaluation involved direct comparison against known ground truth. Relevance was scored on a 6-point scale (0–5), where higher scores reflected greater alignment with the catalog information.

In cases where the evaluators disagreed by more than one point, the final score was determined by joint review and discussion. For one-point differences, an average score was recorded. Although we did not compute formal inter-rater reliability statistics due to the small reviewer pool, we observed high consistency in ratings. Future work may incorporate a larger group of evaluators and agreement metrics to further strengthen reproducibility. The evaluators rated the relevance of the responses on a scale from 0 to 5, where

5 = Completely relevant (matches ground truth exactly)
4 = Mostly relevant (minor errors but captures intent)
3 = Somewhat relevant (contains useful information but lacks clarity or depth)
2 = Partially relevant (only a small portion is useful)
1 = Barely relevant (mostly incorrect)
0 = Not relevant at all (wrong answer)

Fallback responses such as “Sorry, I do not have enough context to answer your question” were handled consistently across all models. These were typically assigned a score of 2 or below, depending on whether the fallback was appropriate based on the retrieved context. If the context truly lacked the required information, a fallback was considered an acceptable response and scored 2 (partially relevant). If relevant information was available in the context but the model still defaulted to the fallback, it was penalized more heavily (score of 1 or 0) for failing to utilize the retrieved data.

This consistent scoring rule was applied across all models to ensure fair evaluation, and fallback behavior was not penalized arbitrarily but based on the accuracy and completeness of the retrieved content.

The average relevance score for each model was calculated across 25 queries. To determine whether the observed differences in the relevance scores across models were statistically significant, we conducted a one-way analysis of variance (ANOVA). The test compared the relevance scores for all four models: Phi-4:14.7B, Llama 3:8B, Llama 3.1:8B, and Llama 3.2:3.21B. The ANOVA results showed no statistically significant difference in the mean relevance scores (F(3, 96) = 1.864, p = 0.1408). While Phi-4:14.7B demonstrated a numerically higher average, the variance among models was not sufficient to reach statistical significance at the 0.05 threshold. These findings suggest that although differences exist, they should be interpreted with caution, and practical factors like latency and infrastructure compatibility remain critical in model selection.

Phi-4:14.7B, the largest model with 14.7 billion parameters (Table 1), achieved the highest average relevance score of 4.68, indicating superior semantic comprehension and contextual retention. This performance is attributed to its extended context length (16,384 tokens) and large embedding size (5120), enabling it to handle complex academic queries with greater precision. However, this comes at the cost of a significantly higher latency, averaging 37.94 s per query, which may be impractical for real-time retrieval scenarios.

In contrast, Llama 3.2:3.21B, the smallest model evaluated, offered the fastest response time of 7.17 s but had the lowest average relevance score (3.88). These results reflect its limitations in comprehending detailed course catalog data, likely due to its smaller embedding dimension (3072) and reduced parameter count of 3.21B (Table 1).

The Llama 3:8B and Llama 3.1:8B models strike a more optimal balance between accuracy and efficiency. Both scored an average relevance of 4.2 while offering moderate response times of 14.20 and 16.43 s, respectively. Llama 3.1:8B stands out with its extended context length of 131,072 tokens, significantly higher than Llama 3:8B’s 8192 tokens, allowing it to better process longer and more structured course descriptions and schedules. The increased context length contributes to improved relevance but also introduces a slight latency overhead. This was the time elapsed between the submission of the query and the appearance of the model’s response, measured through time-related calls and calculations implemented in software.

To evaluate latency, we used system time measurements to record the time from the moment a prompt was issued until the model returned a response. Given the requirements for academic information retrieval, where both accuracy and efficiency are critical, Llama 3.1:8B emerges as the most promising candidate for structured retrieval-based applications, offering a favorable trade-off between response quality and processing latency.

6.2. Analysis of Model Response Time Variability

The response time for each model varied notably across different queries, highlighting how architectural differences and model sizes impact runtime efficiency. This variability is especially important when considering the practical deployment of these models in real-time academic information retrieval systems. The response time analysis (Table 3) highlights significant variations in processing efficiency among the evaluated models. Phi-4:14.7B, despite achieving the highest relevance score in previous evaluations, exhibits the greatest response time variability, with a maximum latency of 82.51 s and an average response time of 37.94 s. This indicates a substantial computational overhead, likely due to its larger parameter count (14.7B) and extended context length (16,384 tokens, as per Table 1), which enhance its semantic understanding but introduce processing delays. On the other hand, Llama 3.2:3.21B, the smallest model, demonstrates the fastest response times, with a minimum of 5.89 s and an average of 7.17 s, making it the most efficient in terms of query resolution speed. However, its limited parameter size (3.21B) and smaller embedding length (3072) may contribute to lower response relevance, as observed in previous evaluations.

Among the mid-sized models, Llama 3:8B and Llama 3.1:8B strike a better balance between response speed and quality. Llama 3:8B shows a relatively stable response time range (11.56 to 20.51 s, averaging 14.20 s), making it a consistent performer with minimal latency fluctuations. Llama 3.1:8B, while offering a slightly higher average response time of 16.43 s, experiences a wider range of response times (13.33 to 26.71 s), potentially due to its longer context length (131,072 tokens, Table 1), which enhances multi-turn query processing but introduces additional computational complexity. These findings suggest that for applications requiring both response relevance and efficiency, Llama 3:8B and Llama 3.1:8B offer a balanced trade-off, while Phi-4:14.7B is best suited for high-accuracy tasks where response latency is less critical.

6.3. Analysis of Relevance Score Distribution Across Models

The job or issue at hand, the amount and quality of data available, the intended output, and the necessary performance level are some of the variables that influence the selection of the right generative model. Given that their training goal is to concurrently encapsulate the complex concealed connections between inputs and assumed outputs, generative models may require greater processing power. Large computations along with a lot of data are needed to adequately comprehend these complexities.

Table 4 provides a detailed breakdown of how often each model produced responses within three score ranges: low relevance (0–1), moderate relevance (2–3), and high relevance (4–5). The distribution reveals that Phi-4:14.7B consistently generates the most highly relevant responses, with 23 out of 25 responses (92%) scoring in the 4–5 range and none in the lowest (0–1) range. This aligns with its higher parameter count (14.7B) and extended context length (16,384 tokens, as seen in Table 1), which likely enable it to capture complex relationships within course catalog queries more effectively. Conversely, Llama 3.2:3.21B, despite being the fastest model in terms of response time (Table 3), exhibits a slightly lower proportion of high-relevance responses (68%) and a relatively higher occurrence of moderate scores (24%). This suggests that while it excels in efficiency, it may struggle to maintain the same level of contextual depth and accuracy as larger models.

Among the mid-sized models, Llama 3:8B and Llama 3.1:8B demonstrate similar relevance distributions, with Llama 3.1:8B achieving a slightly higher proportion of highly relevant responses (80%) than Llama 3:8B (72%). However, Llama 3.1:8B has a slightly wider spread, with two low-relevance responses, indicating occasional inconsistencies. These results suggest that while Phi-4:14.7B is the strongest in terms of response quality, Llama 3.1:8B presents a competitive alternative with a good balance of speed and accuracy. Meanwhile, Llama 3.2:3.21B emerges as the most efficient but at the cost of slightly lower relevance scores. This trade-off between response speed and accuracy is a crucial consideration when selecting an appropriate model for real-world applications, particularly in developing retrieval-augmented generation-based university course catalog systems, where both timely responses and high factual accuracy are critical.

6.4. Analysis of Average Response Time vs. Relevance Score

The data in Table 5 and the average response time vs. relevance score graph (Figure 2) illustrates the relationship between response latency and the quality of the generated outputs across different models. As observed, Phi-4:14.7B consistently exhibits the highest response times, with a notable increase beyond a relevance score of 3, reaching its peak around relevance score 4. This trend aligns with Phi-4’s larger parameter count (14.7B) and extended context length (16,384 tokens, Table 1), which likely contribute to its increased computational complexity and processing time. The model’s ability to generate highly relevant responses, as previously shown in Table 4, comes at the expense of response efficiency, which is a crucial trade-off when selecting a model for real-time applications.

Conversely, Llama 3.2:3.21B emerges as the most efficient model across all the relevance scores, maintaining a low response time even for highly relevant responses (scores 4 and 5). Its smaller parameter size (3.21B) and optimized architecture allow for faster inference times, making it a viable option for applications requiring low-latency responses. However, the previous relevance distribution analysis (Table 4) suggests that while it delivers faster responses, its ability to produce highly relevant answers is slightly inferior to that of Phi-4 and other Llama-3 variants. This makes Llama 3.2:3.21B a strong candidate for situations where speed is prioritized over absolute response quality.

Among the Llama-3 models, Llama 3.1:8B consistently exhibits higher response times than Llama 3:8B across all relevance scores. Notably, for a relevance score of 4, Llama 3.1:8B reaches a peak of 17.63 s, compared to 13.67 s for Llama 3:8B, indicating that its expanded context length (131,072 tokens) comes at the cost of increased latency. This suggests that while Llama 3.1:8B may be more suited to handling long-context queries, its performance gain in terms of relevance does not fully compensate for the additional processing time. The results highlight the trade-offs between model size, response latency, and output quality, which should be carefully evaluated when deploying models for domain-specific applications such as course catalog retrieval in university systems. Figure 4 shows how response time fares against relevance scores.

6.5. Context-Driven Recommendations and Trade-Offs

Based on the comparative evaluation of relevance and latency across models, we provide the following recommendations tailored to typical academic advising scenarios:

For time-sensitive, low-complexity queries (e.g., “When is this course offered?”, “Is it taught online?”), Llama 3.2:3.21B is the most suitable model. Its high responsiveness (average ~7.2 s) and acceptable accuracy make it ideal for real-time interfaces or high-throughput systems where speed is a priority.
For moderately complex, multi-attribute queries (e.g., “Which courses does Dr. Doe teach that are available online?”), Llama 3.1:8B offers a strong balance between context depth and latency. Its extended context window (131K tokens) supports richer multi-turn queries while maintaining sub-20 s response times.
For high-stakes, accuracy-critical tasks (e.g., cross-checking instructor details, advanced course planning), Phi-4:14.7B delivers the best semantic alignment and factual precision. However, due to its high latency (~38 s average), it is better suited for asynchronous systems or backend batch processing rather than live chats.

These recommendations reflect practical trade-offs between accuracy, responsiveness, and deployment feasibility. Institutions may select models based on the available hardware, the expected query volume, and the criticality of the advising task. Additionally, hybrid approaches that route queries to different models based on complexity may offer the best balance for real-world deployment.

6.6. Error Analysis and Observed Limitations

An error analysis was conducted by reviewing low-scoring responses (relevance score ≤ 3) across models. Several common failure patterns were observed:

Misinterpretation of ambiguous terms (Phi-4:14.7B): For the query “Is there a hybrid option for any courses?”, the model returned a fallback response: “Sorry, I don’t have enough context.”. This indicates that although course modality (online, class) data was available, the model did not associate the term “hybrid” with that field, highlighting a semantic gap in interpreting user phrasing.
Lack of structured contact information (Phi-4:14.7B): When asked “How can I contact Professor Jane Doe?”, the model again defaulted to a fallback. While the professor was identified as an instructor in previous queries, the chatbot could not return the contact details (which were not part of the dataset), revealing a boundary between domain coverage and user expectations.
Incomplete retrieval comprehension (Llama 3:8B): In response to “What is the schedule for courses taught by Dr. Doe?”, the model returned raw time data but lacked context, such as the course name. This suggests that retrieved chunks may include relevant pieces, but the model sometimes fails to synthesize them cohesively.
Oversimplified responses (Llama 3:8B, Llama 3.1:8B): When asked about algorithm-related courses or time-of-day specifics (e.g., “Are there any recommended algorithm courses?” or “Is this course in the morning or afternoon?”), both models returned vague or one-word responses like “Yes” or “Afternoon” without the supporting course titles or instructor names. These are cases of under-informative, low-effort generation, often caused by weak context matching or minimal prompting pressure.

These examples illustrate practical limitations such as vocabulary mismatches, data scope boundaries, context synthesis issues, and insufficient response richness. Addressing these will require improvements in the following:

Retrieval filtering (e.g., relevance thresholds);
Prompt design to enforce richer output;
Dataset expansion to include more fields (e.g., instructor contact details, modality flags).

6.7. Ethical Considerations in Academic Deployment

While generative AI offers significant promise in educational settings, its use in academic advising introduces several ethical considerations that must be addressed to ensure responsible deployment.

Accuracy and Student Trust: Students rely on academic advising systems to make important decisions about course planning, graduation timelines, and prerequisites. Any hallucination or misinformation—even if infrequent—can result in academic setbacks or administrative errors. It is therefore essential to clearly communicate the limitations of AI-generated responses and avoid over-reliance on automated outputs.
Human Oversight and Transparency: RAG-based systems should be deployed as advisory tools, not standalone decision-makers. Institutions must ensure that students know when they are interacting with an AI system and encourage verification with human advisors, especially for critical academic decisions.
Privacy and Data Use: While our implementation does not process sensitive student information, future extensions must carefully consider FERPA and institutional data privacy policies. Systems must avoid unintended data retention or exposure, and all training or fine-tuning should use properly anonymized and authorized datasets.

Ethical deployment of LLMs in academic environments requires not only robust technical performance but also careful governance, user education, and the continuous monitoring of system behavior.

Further analysis, as shown in Table 5 (average response time for each relevance score), reveals a clear trend, where the response times increase with higher relevance scores across all models. This highlights the computational trade-offs between accuracy and efficiency, particularly for models with higher parameter counts. For developers and researchers working on RAG-based systems or university course catalogs, these findings emphasize the importance of balancing model size, relevance accuracy, and response time when selecting a model for real-world deployment. Smaller models like Llama 3:3.21B offer a good trade-off for applications requiring faster response times, while larger models like Phi4:14.7B are better suited to cases where higher relevance is critical, albeit with a higher computational overhead.

7. Conclusions

Before being used, generative AI may have distinct risks and advantages that need to be properly considered. In addition to posing additional concerns, generative approaches have the potential to worsen many of the issues connected to classical machine learning. Therefore, it is crucial to comprehend the dangers and set responsible governance standards to assist in reducing them before we can implement generative AI in the real world and on a large scale.

In conclusion, the present work indicates that the results from the analysis of four language models—Llama 3:8B, Llama 3.1:8B, Llama 3.2:3.21B, and Phi4:14.7B—demonstrate notable differences in both relevance and response time across various levels of relevance scores. As observed in Table 2 (average response time and relevance score), the smaller models, such as Llama 3:3.21B, perform efficiently in terms of response time while maintaining relatively consistent relevance scores. On the other hand, the larger model, Phi4:14.7B, while achieving higher relevance scores, particularly in the 4–5 range, suffers from a significant increase in response times, especially when compared to the other models. This suggests that Phi4’s larger parameter size and greater context length may contribute to its higher computational cost, making it less optimal in time-sensitive applications.

It will take further advancements in science and technology, data-specific laws and guidelines, ethical standards, and human-centered management techniques to meet the problems presented by generative AI. These are essential to create a future powered by AI that is inclusive, safe, and egalitarian.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ai6060119/s1, Data spreadsheet: model-response comparison.xlsx.

Author Contributions

Conceptualization, N.B.; methodology, N.B.; software, N.B.; validation, A.M. and N.B.; formal analysis, A.M. and N.B.; investigation, N.B.; resources, A.M. and N.B.; data curation, N.B.; writing—original draft preparation, A.M. and N.B.; writing—review and editing, A.M. and N.B.; visualization, A.M. and N.B.; supervision, A.M.; project administration, N.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data collected is provided in the “model-responses-comparison.xlsx” file under Supplementary Materials.

Acknowledgments

The authors acknowledge the technical help and support provided by the HMCSE technological support team at the University of West Florida. During the preparation of this manuscript/study, the author(s) used Grammarly for the purposes of syntax check and synonyms. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ANN	Approximate Nearest Neighbor
API	Application Programming Interface
BERT	Bidirectional Encoder Representations from Transformers
CRN	Course Reference Number
CSV	Comma-Separated Values
GB	Giga Byte
GenAI	Generative Artificial Intelligence
GPT	Generative Pretrained-trained Transformer
HTML	Hyper Text Markup Language
k-NN	k-nearest Nearest neighborsNeighbor
LLM	Large Language Model
NLG	Natural language Language Generation
NLI	Natural Language Inference
NLP	Natural language processing
PDF	Portable Document Format
QA/Q&A	Question and Answer
RAG	Retrieval-Augmented Generation

References

Vianden, J.; Barlow, P.J. Strengthen the Bond: Relationships between Academic Advising Quality and Undergraduate Student Loyalty. NACADA J. 2015, 35, 15–27. [Google Scholar] [CrossRef]
Assiri, A.; Al-Ghamdi, A.A.M.; Brdesee, H. From traditional to intelligent academic advising: A systematic literature review of eacademic advising. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 507–517. [Google Scholar] [CrossRef]
Chan, Z.C.Y.; Chan, H.Y.; Chow, H.C.J.; Choy, S.N.; Ng, K.Y.; Wong, K.Y.; Yu, P.K. Academic Advising in Undergraduate Education: A Systematic Review. Nurse Educ. Today 2019, 75, 58–74. [Google Scholar] [CrossRef] [PubMed]
Apriceno, M.; Levy, S.R.; London, B. Mentorship during College Transition Predicts Academic Self-Efficacy and Sense of Belonging Among STEM Students. J. Coll. Stud. Dev. 2020, 61, 643–648. [Google Scholar] [CrossRef]
Young-Jones, A.D.; Burt, T.D.; Dixon, S.; Hawthorne, M.J. Academic Advising: Does It Really Impact Student Success? Qual. Assur. Educ. 2013, 21, 7–19. [Google Scholar] [CrossRef]
Hart-Baldridge, E. Faculty Advisor Perspectives of Academic Advising. NACADA J. 2020, 40, 10–22. [Google Scholar] [CrossRef]
Bilquise, G.; Shaalan, K. AI-based academic advising framework: A knowledge management perspective. Int. J. Adv. Comput. Sci. Appl. 2022, 13. [Google Scholar] [CrossRef]
Johnson, R.M.; Strayhorn, T.L.; Travers, C.S. Examining the Academic Advising Experiences of Black Males at an Urban University: An Exploratory Case Study. Urban Educ. 2023, 58, 774–800. [Google Scholar] [CrossRef]
Blau, G.; Williams, W.; Jarrell, S.; Nash, D. Exploring Common Correlates of Business Undergraduate Satisfaction with Their Degree Program versus Expected Employment. J. Educ. Bus. 2019, 94, 31–39. [Google Scholar] [CrossRef]
Khan, A.; Ud Din, S.; Anwar, M. Sources and adverse effects of burnout among academic staff: A systematic review. City Univ. Res. J. 2019, 9, 350–363. [Google Scholar]
Marken, S.; Agrawal, S. K–12 Workers Have Highest Burnout Rate in U.S. Gallup. 13 June 2022. Available online: https://news.gallup.com/poll/393500/workers-highest-burnout-rate.aspx (accessed on 29 May 2025).
Gregerson, K.; Sutton, L.; Miller, O. From self-care to systemic change: The evolution of advisor well-being in NACADA. Acad. Advis. Today 2022, 45. [Google Scholar]
Soria, K.M.; Kokenge, E.; Heath, C.A.; Standley, E.C.; Shannon Wilson, J.F.; Connley, J.R.; Agramon, A.I. Factors Associated with Academic Advisors’ Burnout. NACADA J. 2023, 43, 105–120. [Google Scholar] [CrossRef]
Maslach, C.; Leiter, M.P. Understanding the burnout experience: Recent research and its implications for psychiatry. World Psychiatry 2016, 15, 103–111. [Google Scholar] [CrossRef]
Gabbe, S.G.; Melville, J.; Mandel, L.; Walker, E. Burnout in chairs of obstetrics and gynecology: Diagnosis, treatment, and prevention: Presidential address. Am. J. Obstet. Gynecol. 2002, 186, 601–612. [Google Scholar] [CrossRef] [PubMed]
Brewer, E.W.; Clippard, L.F. Burnout and job satisfaction among student support services personnel. Hum. Resour. Dev. Q. 2002, 13, 169–186. [Google Scholar] [CrossRef]
Mullen, P.R.; Malone, A.; Denney, A.; Dietz, S.D. Job stress, burnout, job satisfaction, and turnover intention among student affairs professionals. Coll. Stud. Aff. Prof. 2018, 36, 94–108. [Google Scholar] [CrossRef]
Gellock, J.L. Work-Life Factors That Impact Job Burnout and Turnover Intention Among Athletic Academic Support Professionals. Ph.D. Dissertation, Virginia Commonwealth University, Richmond, VA, USA, 2019. VCU Scholars Compass. Available online: https://scholarscompass.vcu.edu/etd/5799/ (accessed on 29 May 2025).
Bichsel, J.; Fuesting, M.; Schneider, J.; Tubbs, D. The CUPA-HR 2022 Higher Education Employee Retention Survey: Initial Results. CUPA-HR. 2022. Available online: https://www.cupahr.org/surveys/research-briefs/higher-ed-employee-retention-survey-findings-july-2022/ (accessed on 29 May 2025).
West, T.K. Academic Advising Staff Turnover at a 2-Year College; Organizational Leadership Department, Abilene Christian University: Abilene, TX, USA, 2025. [Google Scholar]
Golla, F. Enhancing Student Engagement Through AI-Powered Educational Chatbots: A Retrieval-Augmented Generation Approach. In Proceedings of the 21st International Conference on Information Technology Based Higher Education and Training (ITHET), Paris, France, 6–8 November 2024; Available online: https://ieeexplore.ieee.org/abstract/document/10837678 (accessed on 15 April 2025).
Wijaya, O.C.; Purwarianti, A. An Interactive Question-Answering System Using Large Language Model and Retrieval-Augmented Generation in an Intelligent Tutoring System on the Programming Domain. In Proceedings of the 2024 11th International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA), Yogyakarta, Indonesia, 4–5 September 2024; pp. 1–6. Available online: https://ieeexplore.ieee.org/abstract/document/10763263 (accessed on 15 April 2025).
Saha, B.; Saha, U.; Malik, M.Z. QuIM-RAG: Advancing Retrieval-Augmented Generation with Inverted Question Matching for Enhanced QA Performance. IEEE Access 2024, 12, 185401–185410. [Google Scholar] [CrossRef]
Fernandez, N.; Scarlatos, A.; Lan, A. SyllabusQA: A Course Logistics Question Answering Dataset. In Proceedings of the ACL 2024: The 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4 December 2017. [Google Scholar] [CrossRef]
Pennington, J.; Socher, R.; Manning, C. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014. [Google Scholar]
Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proceedings of the 3rd International Conference for Learning Representations (ICLR 2016), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’13), Lake Tahoe, NV, USA, 5–10 December 2013. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar] [CrossRef]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners; Technical Report; OpenAi: San Francisco, CA, USA, 2019. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA. [CrossRef]
Carlini, N.; Tramer, F.; Wallace, E.; Jagielski, M.; Herbert-Voss, A.; Lee, K.; Roberts, A.; Brown, T.; Song, D.; Erlingsson, U.; et al. Extracting training data from large language models. In Proceedings of the 32nd USENIX Conference on Security Symposium (SEC’23), Anaheim, CA, USA, 9–11 August 2023; ISBN 978-1-939133-37-3. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, 6-12 December 2020. [Google Scholar] [CrossRef]
Claude. Claude (Oct 8 version) [Chatbot]. Available online: https://claude.ai (accessed on 15 April 2025).
OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Taha, K. Automatic academic advisor. In Proceedings of the 8th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom), Pittsburgh, PA, USA, 14–17 October 2012; pp. 262–268. [Google Scholar] [CrossRef]
EAB. Embrace AI to Solve Campus Problems in New Ways. 2018. Available online: https://eab.com/insights/blogs/it/embrace-ai-to-solve-old-campus-problems-in-new-ways (accessed on 6 July 2023).
Al-Hunaiyyan, A.; Bimba, A.T.; Alsharhan, S. A cognitive knowledge-based model for an academic adaptive e-advising system. Interdiscip. J. Inf. Knowl. Manag. 2020, 15, 247–263. [Google Scholar] [CrossRef]
Lucien, R.S. Design, Development, and Evaluation of an Artificial Intelligence-Enabled Chatbot for Honors College Student Advising in Higher Education. Ph.D. Thesis, University of South Florida, Tampa, FL, USA, 2021. [Google Scholar]
Akiba, D.; Fraboni, M.C. AI-supported academic advising: Exploring ChatGPT’s current state and future potential toward student empowerment. Educ. Sci. 2023, 13, 885. [Google Scholar] [CrossRef]
Liu, F.; Kang, Z.; Han, X. Optimizing RAG Techniques for Automotive Industry PDF Chatbots: A Case Study with Locally Deployed Ollama Models. arXiv 2024, arXiv:2408.05933. [Google Scholar]
Patel, H.N.; Surti, A.; Goel, P.; Patel, B. A Comparative Analysis of Large Language Models with Retrieval-Augmented Generation based Question Answering System. In Proceedings of the 2024 8th International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Tiruchirappalli, India, 10–12 October 2024. [Google Scholar] [CrossRef]
Amarnath, N.S.; Nagarajan, R. An Intelligent Retrieval Augmented Generation Chatbot for Contextually-Aware Conversations to Guide High School Students. In Proceedings of the 2024 4th International Conference on Sustainable Expert Systems (ICSES), Coimbatore, India, 6–8 May 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
Antico, C.; Giordano, S.; Koyuturk, C.; Ognibene, D. Unimib Assistant: Designing a student-friendly RAG-based chatbot for all their needs. In Proceedings of the Italian Workshop on Artificial Intelligence for Human Machine Interaction (AIxHMI 2024), Bolzano, Italy, 26 November 2024. [Google Scholar] [CrossRef]
Neupane, S.; Hossain, E.; Keith, J.; Tripathi, H.; Ghiasi, F.; Golilarz, N.A.; Rahimi, S. From Questions to Insightful Answers: Building an Informed Chatbot for University Resources. In Proceedings of the IEEE Frontiers in Education (FIE) 2024, Washington, DC, USA, 13–16 October 2024. [Google Scholar] [CrossRef]
Monteiro, H. Chatting Over Course Material: The Role of Retrieval Augmented Generation Systems in Enhancing Academic Chatbots. Master’s Thesis, Luleå University of Technology, Luleå, Sweden, 17 June 2024. [Google Scholar]
Nie, Y.; Williams, A.; Dinan, E.; Bansal, M.; Weston, J.; Kiela, D. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar] [CrossRef]
Cheng, M.; Durmus, E.; Jurafsky, D. Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023. [Google Scholar] [CrossRef]
Ko, K.; Nyein, T.Y.; Oo, K.K.; Oo, T.Z.; Zin, T.T. Retrieval Augmented Generation for Document Query Automation using Open Source LLMs. In Proceedings of the 2024 5th International Conference on Advanced Information Technologies (ICAIT), Yangon, Myanmar, 6–7 November 2024; pp. 1–6. [Google Scholar] [CrossRef]

Figure 1. RAG pipeline diagram.

Figure 2. Diagram showing the detailed methodology used.

Figure 3. The data collection process followed.

Figure 4. Average response time vs. relevance score.

Table 1. Model specifications and architectural details.

Model	Parameters (Billions)	Context Length	Embedding Length	Size (GB)
Llama3:8B	8	8192	4096	4.7
Llama3.1:8B	8	131072	4096	4.9
Llama3.2:3.21B	3.21	131072	3072	2
Phi4:14.7B	14.7	16384	5120	9.1

Table 2. Model comparison by relevance score and response time.

Model	Avg Relevance Score	Avg Response Time (s)	Number of Questions
Llama3:8B	4.2	14.198	25
Llama3.1:8B	4.2	16.434	25
Llama3.2:3.21B	3.88	7.172	25
Phi4:14.7B	4.68	37.936	25

Table 3. Response time statistics of the evaluated models.

Model	Max Response Time (s)	Min Response Time (s)	Avg Response Time (s)
Llama3:8B	20.51	11.52	14.198
Llama3.1:8B	26.71	13.33	16.434
Llama3.2:3.21B	12.92	5.89	7.172
Phi4:14.7B	82.51	17.69	37.936

Table 4. Relevance score distribution by model.

Model	0–1 Relevance	2–3 Relevance	4–5 Relevance
Llama3:8B	1	6	18
Llama3.1:8B	2	3	20
Llama3.2:3.21B	2	6	17
Phi4:14.7B	0	2	23

Table 5. Average response time for each relevance score by mode.

Relevance Score (0–5)	Llama3:8B	Llama3.1:8B	Llama3.2:3.21B	Phi4:14.7B	Grand Total
0			7.22		7.22
1	12.43	14.605			13.88
2	12.64	15.565	6.43	29.52	15.944
3	13.88	14.68	8.604	30.16	13.105
4	13.67	17.633	6.736	44.323	18.546
5	14.631	16.582	6.811	37.788	20.988

Some models do not have values for some scores.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mishra, A.; Brahmanapally, N. A Comparative Performance Analysis of Locally Deployed Large Language Models Through a Retrieval-Augmented Generation Educational Assistant Application for Textual Data Extraction. AI 2025, 6, 119. https://doi.org/10.3390/ai6060119

AMA Style

Mishra A, Brahmanapally N. A Comparative Performance Analysis of Locally Deployed Large Language Models Through a Retrieval-Augmented Generation Educational Assistant Application for Textual Data Extraction. AI. 2025; 6(6):119. https://doi.org/10.3390/ai6060119

Chicago/Turabian Style

Mishra, Amitabh, and Nagaraju Brahmanapally. 2025. "A Comparative Performance Analysis of Locally Deployed Large Language Models Through a Retrieval-Augmented Generation Educational Assistant Application for Textual Data Extraction" AI 6, no. 6: 119. https://doi.org/10.3390/ai6060119

APA Style

Mishra, A., & Brahmanapally, N. (2025). A Comparative Performance Analysis of Locally Deployed Large Language Models Through a Retrieval-Augmented Generation Educational Assistant Application for Textual Data Extraction. AI, 6(6), 119. https://doi.org/10.3390/ai6060119

Article Menu

A Comparative Performance Analysis of Locally Deployed Large Language Models Through a Retrieval-Augmented Generation Educational Assistant Application for Textual Data Extraction

Abstract

1. Introduction

1.1. Importance of Academic Advising in Student Success and Institute Metrics

1.2. Challenges Faced in Academic Advising

1.3. Ways of Addressing the Issues in Academic Advising

1.4. AI for Academic Advising Support

2. Literature Survey

3. Transformer Networks

3.1. Retrieval-Augmented Generation Frameworks

3.1.1. Choice of RAG

3.1.2. Overview of RAG

3.1.3. The RAG Pipeline

4. The Test Case and Its Choice

4.1. Choice of Models Used in the Performance Analysis

4.2. Rationale for Model Selection

4.2.1. Phi-4 (14.7B Parameters, 16K Context)

4.2.2. Llama Models

4.2.3. Infrastructure Configuration and Environment Setup

5. Methodology of Evaluation

5.1. Data Collection and Extraction

5.2. Data Embedding and Storage

5.3. Retrieval and Augmentation

5.4. Response Generation

5.5. Types of Questions Tested on the Chatbot

5.5.1. Instructor-Related Queries

5.5.2. Course Availability and Format

5.5.3. Course Schedule and Location

5.5.4. Course Identification and Details

5.5.5. Course Recommendations and Campus-Based Queries

6. Results and Discussion

6.1. Analysis of Model Performance for Course Catalog Query Processing

6.2. Analysis of Model Response Time Variability

6.3. Analysis of Relevance Score Distribution Across Models

6.4. Analysis of Average Response Time vs. Relevance Score

6.5. Context-Driven Recommendations and Trade-Offs

6.6. Error Analysis and Observed Limitations

6.7. Ethical Considerations in Academic Deployment

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI