Feasibility Evaluation of Secure Offline Large Language Models with Retrieval-Augmented Generation for CPU-Only Inference

Tyndall, Erick; Wagner, Torrey; Gayheart, Colleen; Some, Alexandre; Langhals, Brent

doi:10.3390/info16090744

Open AccessArticle

Feasibility Evaluation of Secure Offline Large Language Models with Retrieval-Augmented Generation for CPU-Only Inference

by

Erick Tyndall

^1,*

,

Torrey Wagner

²,

Colleen Gayheart

¹

,

Alexandre Some

¹ and

Brent Langhals

²

¹

Graduate School of Engineering and Management, Air Force Institute of Technology, Wright-Patterson AFB, OH 45433, USA

²

Data Analytics Certificate Program, Graduate School of Engineering and Management, Air Force Institute of Technology, Wright-Patterson AFB, OH 45433, USA

^*

Author to whom correspondence should be addressed.

Information 2025, 16(9), 744; https://doi.org/10.3390/info16090744

Submission received: 20 June 2025 / Revised: 1 August 2025 / Accepted: 26 August 2025 / Published: 28 August 2025

Download

Browse Figures

Versions Notes

Abstract

Recent advances in large language models and retrieval-augmented generation, a method that enhances language models by integrating retrieved external documents, have created opportunities to deploy AI in secure, offline environments. This study explores the feasibility of using locally hosted, open-weight large language models with integrated retrieval-augmented generation capabilities on CPU-only hardware for tasks such as question answering and summarization. The evaluation reflects typical constraints in environments like government offices, where internet access and GPU acceleration may be restricted. Four models were tested using LocalGPT, a privacy-focused retrieval-augmented generation framework, on two consumer-grade systems: a laptop and a workstation. A technical project management textbook served as the source material. Performance was assessed using BERTScore and METEOR metrics, along with latency and response timing. All models demonstrated strong performance in direct question answering, providing accurate responses despite limited computational resources. However, summarization tasks showed greater variability, with models sometimes producing vague or incomplete outputs. The analysis also showed that quantization and hardware differences affected response time more than output quality; this is a tradeoff that should be considered in potential use cases. This study does not aim to rank models but instead highlights practical considerations in deploying large language models locally. The findings suggest that secure, CPU-only deployments are viable for structured tasks like factual retrieval, although limitations remain for more generative applications such as summarization. This feasibility-focused evaluation provides guidance for organizations seeking to use local large language models under privacy and resource constraints and lays the groundwork for future research in secure, offline AI systems.

Keywords:

Generative Pre-trained Transformer (GPT); Large Language Models (LLMs); LocalGPT; offline AI models; question answering; Retrieval-Augmented Generation (RAG); text summarization

1. Introduction

Artificial intelligence (AI) has become a pivotal topic across numerous fields, involving the development of systems that replicate human intelligence or perform tasks traditionally requiring human creativity and cognitive skills. These systems are trained on extensive and diverse datasets, enabling them to improve their accuracy and performance over time by learning from data patterns and outcomes. As a result, AI systems can make progressively more accurate decisions as they process more information during training and fine-tuning [1]. One tool driving AI advancements is the development of language models. Language models are computational systems designed to produce human-like language and developed using mathematical frameworks that include parameters to predict word sequences accurately [2]. While many are generative and designed to output sequences of text, others perform tasks like classification or retrieval using similar probabilistic frameworks. Generative models manage the relationships between words and phrases through transformers, a type of neural network architecture that encodes text into numerical representations by applying attention mechanisms to assign weighted values based on context [3]. These models generate text based on patterns learned during training. However, their responses may occasionally deviate from natural human speech due to inherent limitations in the underlying mathematical structure and training data. Continuous training and fine-tuning help improve the coherence and accuracy of generated text, making it more closely resemble human communication. Language models can also be specialized for specific domains and applications and are increasingly employed to answer a wide range of user queries and assist with diverse tasks.

Despite the rapid advancement of large language models, their use in secure, resource-constrained environments such as government offices remains limited. Many available solutions depend on cloud infrastructure or GPU-accelerated systems, which are often incompatible with government computing environments due to security, cost, or procurement restrictions. This presents a critical gap in understanding how language models can be deployed effectively on CPU-only systems that operate offline. The central problem addressed in this research is whether open-source language models, when combined with retrieval-augmented generation techniques, can support essential tasks such as question answering and summarization on commonly available hardware without compromising performance, accuracy, or data privacy.

Objectives

This study addresses the challenge by evaluating the feasibility of using various large language models on consumer-grade hardware, with a particular emphasis on models optimized for CPU-based performance rather than GPU-based systems. To achieve this goal, the following specific objectives were established:

Identify and apply appropriate model evaluation metrics.
Assess a range of models using the selected metrics.
Develop a methodology to test text summarization and question-answering models.
Develop code to test and evaluate models.
Recommend the most effective text summarization and question-answering models based on performance.

One way to categorize current large language models (LLM) offerings is by distinguishing between cloud-based and locally deployed models, each with unique advantages and limitations. These models have a wide range of possible applications, and their performance can be enhanced through various techniques, such as domain-specific fine-tuning and retrieval-augmented generation (RAG). Domain-specific fine-tuning involves adjusting models to perform optimally in specialized areas of knowledge, while RAG integrates external data sources during the generation process to improve the accuracy and relevance of responses. Although both methods offer significant advantages, this study will focus on RAG due to its ability to dynamically incorporate relevant external information, which can improve accuracy and reduce hallucinations rather than directly enhancing model performance.

2. Materials and Methods

The language models were selected based on their open-source availability, ability to function securely offline, and compatibility with consumer-grade hardware. The hardware used in this study includes a high-end workstation and a standard consumer laptop, selected both for their availability and for their relevance to real-world deployment scenarios. These CPU-only configurations were intentionally chosen to reflect the limitations commonly found in government office environments, where computer systems are typically not equipped with dedicated GPUs due to cost, security, or procurement constraints. Additionally, the study details the evaluation metrics, including accuracy, efficiency, and summarization capabilities, providing a comprehensive overview of how these models perform in diverse computational environments. This approach enables a practical assessment of deploying advanced language models efficiently, even in settings with limited computational resources.

2.1. Framework Selection

Five tools for hosting LLMs locally were evaluated: LocalGPT [4], PrivateGPT [5], GPT4All [6], llama.cpp [7], and Ollama [8]. Each tool was assessed based on several factors, including compatibility with typical government-approved operating systems (Windows and Ubuntu). Therefore, the tools were assessed not only for performance but also for their ability to function offline after initial setup, native support for RAG, and ease of use without requiring extensive programming expertise.

It is important to note that this study did not aim to build a custom retrieval-augmented generation system by separately configuring inference backends and RAG frameworks. Instead, the goal was to evaluate complete, open-source tools that could be deployed with minimal setup. While modular separation between inference engines and RAG layers is common in large-scale production environments, this study prioritized integrated solutions in order to focus on the real-world usability and performance of established open-source packages that do not require significant development effort. This approach reflects the practical needs of organizations operating in secure or resource-constrained environments, where technical staff may not have the capacity to implement or maintain custom pipeline architectures.

While llama.cpp and Ollama were both considered, ultimately, neither was selected for this study. llama.cpp is a low-level C++ implementation originally designed for running LLaMA models on consumer-grade hardware and has since expanded to support a broader range of models [7]. Ollama builds on llama.cpp and simplifies model loading and inference with additional optimizations [8]. However, neither tool provides native support for retrieval-augmented generation (RAG) pipelines. Implementing custom RAG functionality would have required extensive integration work, such as binding inference engines with separate retrieval and chunking frameworks. Because this study aimed to evaluate out-of-the-box open-source tools with minimal development overhead, llama.cpp and Ollama were excluded.

LocalGPT, developed by PromptEngineer, offers an integrated RAG setup, aligning with the project’s emphasis on practical deployment rather than system-level toolchain development. It provides efficient inference performance and native RAG capabilities, and it is compatible with both Windows and Ubuntu operating systems [4]. It also allows the integration of various language models and supports ingestion from various document types, including DOCX, PDFs, spreadsheets, and Python code. Marketed as a RAG pipeline, LocalGPT operates entirely within the local environment, ensuring that data never leaves the user’s system while maintaining performance [4]. LocalGPT supports multiple hardware configurations, including NVIDIA GPUs, Apple Silicon CPUs, and traditional x86 CPUs, making it accessible to users with limited resources. The software includes features like the ability to save chat conversations, utilize context history, and display cited sources, which are essential functions for ensuring effective RAG operation and data retrieval accuracy [4].

LocalGPT builds upon PrivateGPT by improving performance across platforms while maintaining the same focus on privacy and offline operation. It also enhances document processing capabilities with more efficient handling of larger datasets and improved RAG functionalities. In contrast, PrivateGPT lacked some of the more advanced RAG features necessary for the project’s requirements, and GPT4All showed limitations in handling document types effectively, leading to its exclusion. As a result, LocalGPT was selected because it met all the project criteria including compatibility, performance, and ease of integration into the project environment.

2.2. Hardware Selection

Although both devices used to test the language models are accessible to typical consumers, their capacities and specifications differ significantly. The first device, an HP EliteBook 840 G5, is a laptop with 16 GB of RAM and an Intel Core i7-8650U processor, commonly found in mainstream consumer laptops. This CPU operates at a base frequency of 1.90 GHz and includes four cores [9]. The selection of this device simulates the experience of a typical consumer, such as a student or researcher, who may wish to use a lightweight RAG system for private document queries. The second device, an HP Z820 Workstation, is a desktop computer equipped with 128 GB of RAM and powered by two Intel Xeon E5-2667v2 CPUs typically used in server environments. Each CPU operates at a base frequency of 3.30 GHz and features eight cores [10]. This configuration represents the higher-end of consumer hardware, offering substantial processing power and speed, ideal for machine learning tasks requiring intensive computation.

Neither device employed a GPU, mirroring the hardware constraints found in many public sector or government office environments. This choice allowed for a more realistic assessment of LLM performance in secure, offline deployments where data privacy is paramount and high-end graphical processing units are often unavailable or unauthorized. It is worth mentioning that language model performance on a GPU can be up to 100 times faster than on a CPU, assuming the GPU has sufficient VRAM [11]. A minor consideration in this experiment is the operating system used on each device. One computer ran Windows 11, while the other ran Ubuntu 22.04. Nevertheless, both systems deployed a virtual environment provided by Anaconda, ensuring that any differences in performance attributable to the operating system should be negligible.

2.3. Model Selection

The selection of language models was a critical aspect of optimizing the RAG system for question-answering and summarization tasks. Several models were considered for testing. However, each selected model had to meet a few essential requirements. First, the models needed to be available for local use with accessible pre-trained weights and licensing terms that allowed evaluation and redistribution. Although the term “open source” is often used broadly, it is important to recognize that openness in AI involves multiple factors, including access to training data, fine-tuning code, documentation, and permissive licensing for commercial use. For the purposes of this study, models were classified as open source if they provided downloadable weights, community-maintained support, and licenses that permitted local deployment for research or non-commercial applications. This working definition ensured that the selected models could be integrated into an offline, CPU-only environment with transparency and reproducibility.

Second, the models needed to function securely offline. This requirement was vital for replicating secure deployment scenarios commonly encountered in defense or healthcare applications. In many of these environments, especially within government offices, internet access may be restricted or monitored, and cloud-based tools are often prohibited. Therefore, the selected models had to support isolated operation on local CPUs without sacrificing reliability or performance. Lastly, the models had to be lightweight enough to run efficiently on consumer-grade hardware, particularly on modest CPUs. This aspect was critical to assess the feasibility of deploying language models on commonly accessible consumer hardware. Finally, the chosen models needed to strike a balance between having sufficient parameters to ensure accurate responses while not exceeding computational resources or user response expectations.

LocalGPT leverages Hugging Face, a widely used open-source repository, where organizations such as Meta and Microsoft distribute their models for community use. All models tested in this research used the GPT-Generated Unified Format (GGUF), a cross-compatible file format for LLMs designed to optimize model performance on CPUs [12]. Table 1 provides an overview of the four models evaluated in this study, including their size, memory requirements, creators, and quantization details.

The first open-source model tested was Phi-3-Mini-4k-Instruct, developed by Microsoft [13]. With 3.8 billion parameters, this model is designed as a lightweight, open-source model trained on a mixture of high-quality datasets, including synthetic data and internet-sourced information, with a focus on solid reasoning capabilities [13]. Its primary use case is in memory-constrained environments where latency is a factor, making it a suitable choice for this resource-limited research [13]. The non-quantized version of the model, with a size of 7.2 GB, was selected for testing.

Another model selected was part of Meta AI’s widely used Large Language Model Meta AI (LLaMA) collection, which ranges from 1 billion to 405 billion parameters [14]. LLaMA models are designed to deliver high performance across different resource constraints, and these models achieve competitive results even when compared to much larger models like GPT-3 [15]. Given that LocalGPT is optimized for LLaMA models, testing this model was a logical choice due to its efficiency and widespread adoption. One version tested, Llama 2 Chat 7B [16], was quantized to 4 bits to reduce memory usage and computational requirements [17]. The GGUF was used for seamless integration and was distributed by a Hugging Face user known as “The Bloke” [18]. The Llama 2 Chat 13B model [19], with 13 billion parameters, was also tested to evaluate the impact of model size on accuracy. The 4-bit quantized version of this model was used as recommended.

The final model tested was Mistral 7B, developed by Mistral AI [20]. Known for its high performance and efficiency, Mistral 7B is engineered to accelerate inference speed and reduce memory requirements, making it an excellent candidate for this research [21]. This model was also quantized to 4 bits to optimize its performance in resource-constrained environments.

2.4. Resource Selection

The selection of the source material for testing the language models was based on several important considerations. First, the source material needed to be in a widely accessible format, such as PDF, to ensure compatibility with the RAG pipeline’s document ingestion process. Additionally, the source had to contain clear, well-defined information that the selected language models could search for and retrieve. The material also needed to be sufficiently large to support multiple queries based on the content. Finally, the source had to primarily consist of text, as text-heavy content would allow for more efficient testing of smaller language models, which might struggle with complex visual data like figures or charts.

Given these criteria, a college textbook was deemed the most suitable choice, as textbooks typically provide clear explanations and contain extensive amounts of textual information. A textbook with minimal reliance on figures was also prioritized to avoid confusing smaller models. The selected textbook was Project Management for Engineering, Business, and Technology (5th Edition) by John M. Nicholas and Herman Steyn [22]. This 1389-page textbook offers comprehensive coverage of business and management concepts, particularly for individuals starting careers as managers and leaders in STEM fields [22]. It is clear and structured content made it an ideal resource for testing the selected LLMs, allowing them to work efficiently across each chapter. The textbook was chosen not only for its clarity and extensive textual content, but also because the research team had prior experience with the material, and it was already available in a format compatible with the pipeline. In addition, its content could be referenced in a reproducible manner without legal or ethical ambiguity, making it a practical and academically appropriate choice for this study. Importantly, while this textbook was selected specifically for its structural applicability to this research, any textbook or document meeting similar criteria could be used for replication, provided that proper considerations for publishing and copyright compliance are observed.

2.5. Question Selection

Based on the RAG-accessible textbook, queries were developed to evaluate the performance of each language model in question-answering and summarization tasks. The questions were authored by two members of the research team, based on their experiences with the textbook in coursework and assignments during their undergraduate studies. This approach ensured the inclusion of academically relevant questions reflective of real-world classroom use cases.

The questions were designed to simulate both direct and indirect queries based on the source material, reflecting the variety of inquiries a typical user might pose to a language model. Additionally, the questions were distributed across different sections of the book, ensuring that the models were tested on a broad range of topics and challenged with diverse content. The set included a mix of direct factual queries and more conceptual prompts to assess the models’ ability to retrieve specific information and synthesize broader themes using retrieval-augmented generation.

The first five questions are direct, requiring straightforward, factually correct answers. The final five questions consist of two parts: the first part asks for the chapter’s name, serving as a more challenging test of the model’s accuracy. The second part evaluates the model’s summarization capabilities by requiring a concise summary of a specified chapter’s content. Overall, the questions were crafted to assess both the conventional capabilities of the selected LLMs to retrieve precise answers and their effectiveness in condensing and summarizing information. The questions and their corresponding references are detailed in Table 2.

2.6. Metric Selection

Metrics are critical in evaluating the effectiveness of language models, particularly when comparing their performance. This research focuses on measuring the capabilities of a RAG application to provide accurate answers and summaries while also promoting efficient, concise responses. Additionally, speed metrics play an important role, offering insights into the performance of LLMs across different types of hardware.

Latency, an essential aspect of LLM inference performance, helps determine how effectively a model operates in real-world scenarios by measuring how quickly it can deliver responses [23]. As this research is not focused on throughput (the number of requests a model can handle simultaneously), this metric solely focuses on latency. Latency was measured in 3 parts: (1) the time taken to process a prompt; (2) the time taken to generate a response; and (3) per-token latency, calculated by total time/tokens used. These metrics offer insights into how factors like the number of parameters and level of quantization impact model efficiency. LocalGPT recorded all latency data, facilitating tracking of results.

To assess the quality of model outputs, both statistical and model-based evaluation metrics were used. Statistical methods compare a reference passage to the generated text using metrics such as precision and recall. Common tools like Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [24], Bilingual Evaluation Understudy (BLEU) [25], and F1 score are widely adopted but often fail to capture nuanced semantic differences [26]. To address these limitations, this study selected Metric for Evaluation of Translation with Explicit Ordering (METEOR) and BERTScore for primary use.

METEOR improves upon traditional statistical methods by normalizing words to their base form and using a database of synonyms, reducing the penalties for minor word variations [27]. However, as a statistical method, it may still penalize models that do not perfectly match the reference text, even when the semantic meaning matches. BERTScore leverages pre-trained language models and computes the cosine similarity between the contextual embeddings of words in the reference and generated texts [28]. This approach allows for better recognition of contextual meaning, reducing penalties for word choice variations while incorporating classical statistical measures like recall and F1 score. Together, these two metrics provide a comprehensive evaluation, balancing the precision of statistical methods with the contextual understanding offered by model-based approaches.

Table 3 illustrates how different sentence variations affect both METEOR and BERTScore metrics. However, it should be noted that neither metric performs particularly well at identifying the negative statement, as seen in the table. Despite this limitation, this issue is unlikely to influence the results of the project, as the type of questions being evaluated do not result in this specific test scenario.

These metrics were applied by measuring the similarity of each model response to a reference excerpt. The reference excerpts for the first five questions (Q1–5) are from specific pages of the textbook and are provided on the right side of Appendix A Table A1. The excerpts for the remaining five questions (Q6–10) were generated by a state-of-the-art cloud-based model (ChatGPT-4o) with access to the same textbook. These excerpts are shown on the right side of Appendix A Table A2. A faculty member with extensive project management experience reviewed all excerpts for accuracy and relevance.

2.7. Implementation

The implementation process began with the setup of all necessary software components. LocalGPT, an open-source RAG framework, was used to manage document ingestion, vectorization, and inference. The project repository was cloned from https://github.com/PromtEngineer/localGPT (accessed on 15 July 2024) and configured in a Python 3.10 Conda environment on both Windows 11 and Ubuntu 22.04 systems. Required dependencies were installed using the requirements.txt file provided in the repository. LocalGPT was run in CPU-only mode, with GGUF-format models placed in the designated model directory. No additional components were added or modified beyond what was required to ingest and query the study’s source material. The RAG system followed LocalGPT’s default configuration, which includes FAISS as the vector database, cosine similarity for retrieval, and a sliding window chunking strategy with a chunk size of 512 tokens and an overlap of 64 tokens. Once the environment was prepared, the source textbook was ingested and the evaluation questions were finalized. Formal testing was performed using the four models shown in Table 1, focused on performance and resource efficiency. Each model was given the questions in Table 2 ten times, metrics were calculated, and statistics were recorded.

During testing, no other applications or background processes were run to preserve timing accuracy. Output was logged from the terminal and saved as .txt files. A Jupyter notebook was used to organize results and extract response and timing data using Python libraries including Pandas (version 2.2.2) and Matplotlib (version 3.9.1). Model outputs were scored using METEOR and BERTScore against the reference excerpts shown in Appendix A. All evaluations were performed using automated metrics; no manual grading was conducted.

All models were executed via the LocalGPT interface using llama-cpp-python as the inference backend. Inference was performed in CPU-only mode with default multi-threading enabled; no manual tuning of parallelism parameters (e.g., thread count or batching) was applied. Additionally, all models were evaluated using identical inference parameters: temperature = 0.2 and repetition_penalty = 1.15. Questions were submitted as individual prompts in a fixed order during a single session for each model, reflecting typical offline usage scenarios. This design ensured consistency and reduced variability, although fixed ordering and same-session prompting may introduce minor order effects. These considerations are addressed in the Section 4.

3. Results

The results here reflect one complete evaluation cycle for each language model across two hardware platforms. All models were tested using a fixed set of ten questions, responses judged by METEOR and BERTScore metrics for answer quality, capturing both lexical and contextual similarity. This dual-reference approach allowed for evaluating factual accuracy against source material and estimating summarization quality against a strong commercial benchmark.

Multiple trial runs were conducted to ensure the testing environment was stable and consistent. While the processing times for generating results may vary slightly even on the same device, the scoring remained consistent across all tests. Overall, the results indicate that specific models demonstrated a superior ability to answer the questions, and the workstation, equipped with higher-performance hardware, generated responses significantly faster than the laptop.

While Section 3.1 and Section 3.2 present observable differences in model behavior across tasks and hardware configurations, the intent of this study is not to rank models by performance. Instead, these results offer practical insight into how well various open-source models function under CPU-only, offline conditions using retrieval-augmented generation. The evaluation highlights strengths and limitations relevant to real-world deployment, supporting the central aim of this feasibility study: determining whether locally hosted language models can provide reliable answers and summaries in secure, resource-constrained environments.

3.1. Workstation Results

Figure 1 presents the BERTScore distributions for each model across the ten evaluation questions on the workstation. Ten boxplots are generated for each model, representing questions 1–10 with question 1 on the left side. A horizontal dashed line shows the mean of means for each model as an overall comparison metric. Model 1 (Phi-3-Mini) achieved noticeably lower overall semantic similarity, but within the standard deviation of other models. Its summarization scores were more stable but remained lower in magnitude compared to other models. Models 2 and 4 (LLaMA 2 Chat 7B and 13B) each produced higher overall averages, centered around 0.89, and exhibited similar performance patterns consistent with their shared architecture. Model 3 (Mistral 7B) exhibited high variability in performance across questions, indicating inconsistent semantic alignment. Model 4 (LLaMA 2 Chat 13B), while maintaining a strong average, showed greater variability across nearly all prompts. These patterns suggest that Model 2 was the most reliable overall, particularly in direct question answering. In all models, Question 10 stood out as the strongest-performing item, indicating it was the most easily understood and accurately reproduced prompt in the dataset.

Figure 2 presents METEOR scores across all ten questions on the workstation, highlighting differences in model behavior between direct question answering (Q1–5) and summarization tasks (Q6–10). For all models, direct QA tasks generally yielded higher scores than summarization. Model 1 (Phi-3-Mini) showed highly consistent but lower-scoring performance, with direct QA scores consistently above its model average and summarization scores consistently below it. Models 2 and 4 (LLaMA 2 Chat 7B and 13B) followed a similar pattern, with higher average scores on direct questions and greater variability in those responses compared to their summarization outputs. Model 3 (Mistral 7B) also demonstrated this split, with four of the five direct QA scores above their respective model averages, while all summarization scores fell below. Across all models, Questions 3 and 4 emerged as the most successful direct-answer prompts, consistently yielding the highest METEOR scores. Among the summarization tasks, Question 10 received the highest scores for Models 1 and 3, while Question 7 was the top performer for Models 2 and 4.

3.2. Laptop Results

Figure 3 presents the BERTScore distributions for each model on the laptop across the ten evaluation questions, separated into direct question answering (Q1–5) and summarization tasks (Q6–10). As in the workstation results, Model 1 (Phi-3-Mini) had the lowest overall semantic similarity, while Models 2 and 4 (LLaMA 2 Chat 7B and 13B) achieved the highest averages. Model 3 (Mistral 7B) scored slightly below Models 2 and 4 but still performed significantly better than Model 1. In Model 1’s results, three of the five direct QA scores were above the model average, whereas most summarization scores were below it. Models 2 and 4 showed strong performance on three direct QA questions, although one was noticeably lower than their averages. Among the summarization tasks, four fell below the model averages, while one stood out positively. Model 3 exhibited similar trends: four of the direct QA scores were above the average, one was significantly lower, and four of the summarization scores were below average with one outperforming. This model also showed high variability across questions. These findings are consistent with the workstation analysis and reinforce the observation that summarization tasks are more challenging across all models, with lower alignment with the reference outputs despite often more consistent scoring.

Figure 4 displays the METEOR scores for each model on the laptop, separated into direct question answering (Q1–5) and summarization tasks (Q6–10). Models 2 and 4 (LLaMA 2 Chat 7B and 13B) achieved the highest overall averages, while Models 1 and 3 scored lower. For Model 1 (Phi-3-Mini), all five direct QA scores were above the model average, and all five summarization scores were below it. The results were relatively consistent, with only one direct QA and one summarization showing noticeable variability. Models 2 and 4 followed a similar pattern, with all direct QA scores above average and all summarization scores below, though each showed slightly wider variation on one or two factual prompts. Model 3 (Mistral 7B) also demonstrated this trend: four of five direct QA scores were above the model average, one was slightly above or below, and all summarization scores were below the average. Variability was more pronounced in the direct QA results, while summarization scores were generally lower and more consistent. These results are consistent with the workstation findings and further support the observation that METEOR favors structured, fact-based outputs and tends to under-represent the semantic quality of longer, generative summaries.

3.3. Latency Analysis

This section provides an analysis of the time taken for a model to process a prompt and generate a response, both as measured in total time and also per-token latency. Additionally, the time to load a model is presented. Table 4 presents prompt processing time, generation time and per-token latency for each model. The mean and standard deviation are calculated from the 10 iterations of 10 questions for each model, providing insight into central tendency and stability. As expected, the workstation consistently outperformed the laptop across all metrics, reflecting its higher processing capabilities.

Model 1 possessed a large per-token time despite having a lower parameter count—this is likely due to quantization. Model 1 is not quantized, compared to 4-bit quantization for the other models, so it has a factor-of-four larger footprint in RAM. Once this is taken into account the generation time appears to scale linearly with model size from Table 1, as Model 1 is 7.2 GB and Model 4 is 7.9 GB. Model 1 exhibited the highest generation times across both hardware platforms, exceeding the linear relationship shown with per-token time. The full-precision Model 1 had slightly lower performance on similarity metrics, and this may be a result of the model’s smaller parameter count or how it interpreted specific prompts, rather than the absence of quantization. This comparison suggests that while quantization has a significant impact on inference speed in CPU-only settings, accuracy is shaped by a combination of factors including model architecture, scale, and prompt formulation.

In addition to performance metrics, model initialization times were also measured. These represent the duration required to load each model into memory using LocalGPT and help characterize how hardware and model size impact inference readiness. Table 5 provides a comparison of load times across both hardware configurations.

For the two mid-sized models, the load times were relatively similar. However, as expected, the load time for the largest model, which has double the number of parameters, was approximately twice as long. There was no significant difference in load times between Models 1–3 on the laptop, as can be seen on the right side of Table 5. The limited hardware capacity of the laptop, combined with CPU-only inference, likely contributed to this difference.

4. Discussion

Several inferences can be drawn from the results of these simulations, and a notable aspect is the effectiveness of different evaluation metrics. As discussed earlier, the METEOR metric proved inconsistent in evaluating summarization accuracy, primarily due to its lack of contextual understanding and reliance on lexical overlap. This inconsistency was evident in the lower and more variable METEOR scores on Q6–10. While METEOR offered some insight into performance trends, it did not reliably distinguish among models in complex tasks. Fortunately, BERTScore provided more stable and interpretable results, particularly for factual question answering (Q1–5). In the BERTScore analysis, Question 10 consistently yielded the highest scores across all models, further indicating that it was the most accessible summarization prompt.

Due to the integration of RAG, which allowed each model to extract relevant information directly from the textbook, all models demonstrated strong performance on direct question answering tasks (Q1–5). This was reflected in high BERTScore and METEOR values when responses were compared to direct textbook excerpts as reference answers. The LLaMA models in particular showed consistent semantic alignment across all factual questions, with Model 2 achieving the highest average BERTScore and METEOR performance on both hardware platforms. These results indicate that, when supported by reliable document retrieval, even smaller language models running on consumer-grade CPUs, can produce accurate, consistent, and contextually aligned answers to direct questions.

However, summarization posed a greater challenge. While the models could locate relevant information, they frequently failed to generate coherent or complete chapter summaries. Several responses were vague, omitted required elements like chapter titles, or declined to provide summaries altogether. In many of these cases the output logs showed the retrieval process returned chunks that did not match the query or failed to return any chunks. These issues suggest that limited context windows and memory constraints reduced the models’ ability to handle long, generative tasks effectively. While RAG improved factual retrieval, it did not compensate for broader summarization limitations.

Another finding relates to the relationship between speed and accuracy on the workstation and the laptop—as expected the difference in hardware did not impact the quality of the language models’ responses. However, the amount of computational resources influenced the time required to load the models, assess the prompts, and generate responses. For example, the average prompt assessment time on the laptop was approximately four times longer than on the workstation (88 s versus 23 s). Across all models the laptop had evaluation times that were 3 to 6 times slower, which is an important consideration when developing applications for different use cases. While results from the workstation and laptop are presented separately, the purpose is to illustrate performance variability across two representative CPU-only environments, not to generalize across all possible hardware configurations.

Two limitations in the evaluation design should be noted. All questions were submitted in a fixed order during a single session for each model, which may introduce minor order effects or session-level context influence. While this approach reflects common usage patterns for offline deployments, it does not eliminate the possibility of sequencing bias. A project management professional evaluated all prompts, reference excerpts and one exam, and judged them to be reasonable. However, they did not review all 100 responses from each of the four models—we used text similarity metrics. As a result, this may limit interpretability in edge cases where nuanced judgment would typically be required.

The primary goal of this study was to evaluate the feasibility of running open-source LLMs with RAG on CPU-only systems in secure, offline environments, not to establish statistically significant differences between models. The observations throughout Section 4 contribute to this aim by identifying which implementation conditions, hardware configurations, and model characteristics influence practical usability and performance. While some findings reflect narrow differences, these details inform deployment decisions where trade-offs between speed, accuracy, and resource usage are critical. For instance, the variability in summarization quality and timing behavior across models reveals potential limitations for certain applications, while the consistent success in direct question answering reinforces confidence in using these systems for structured query tasks. Collectively, these results offer guidance for selecting models under real-world constraints, even if broader theoretical generalizations are beyond the scope of this feasibility-focused evaluation.

5. Conclusions

This study assessed the feasibility and performance of locally hosted large language models using retrieval-augmented generation on consumer-grade hardware for question-answering and summarization tasks. By employing LocalGPT as an interface, four open-source models were evaluated: Phi-3-Mini-4k-Instruct, Llama 2 Chat 7B, Mistral 7B Instruct v0.2, and Llama 2 Chat 13B, on both a high-powered workstation and a standard consumer-grade laptop. The findings demonstrate that modestly sized, locally hosted LLMs can provide accurate and semantically aligned answers to direct factual questions when integrated with RAG, even on devices with limited computational resources. While models differed in timing efficiency, output variability, and alignment with reference answers, all exhibited baseline competency in factual retrieval and question answering under CPU-only constraints. With a focus on identifying practical constraints, benefits, and trade-offs relevant to real-world deployment scenarios, these findings underscore the practical viability of deploying LLMs offline for structured query tasks in resource-limited environments.

Despite their strengths in direct factual answering, the models showed limitations in summarization tasks, particularly when required to condense entire chapters from the source material. The summarization capabilities were inconsistent, and the models often failed to produce accurate or comprehensive summaries. This suggests that while RAG enhances the models’ ability to retrieve specific information, processing and summarizing large text volumes remains challenging for smaller LLMs in resource-constrained environments.

The scope of this research was constrained by several practical factors. Only CPU-based inference was explored, which, while aligned with typical government and resource-restricted environments, does not capture the full performance potential of these models with GPU acceleration. The models selected were intentionally smaller and older, prioritizing known stability, community support, and compatibility with offline environments. This decision limited the evaluation of newer, potentially more capable models. Furthermore, the study relied on a single source document and a fixed set of 10 questions, which may not generalize to all use cases. Restricted internet access, while necessary for simulating secure deployments, also prevented integration with more dynamic RAG configurations that rely on live document retrieval.

Based on the results, the deployment of small to mid-sized open-source LLMs is feasible for secure, offline environments where direct question answering is the primary need. For organizations with limited computing infrastructure, especially those relying on CPU-only systems, quantized models and efficient RAG pipelines like LocalGPT present practical utility. However, caution is advised when using these models for complex summarization tasks unless fine-tuning or supplementary processing is applied. Future research may benefit from evaluating broader document sets, more diverse query types, and newer model architectures. Incorporating GPU-based testing would also provide a more complete performance profile for organizations with greater hardware flexibility. To the authors’ knowledge, this is one of the first empirical studies to evaluate open-source RAG pipelines for local LLM inference in offline or privacy-sensitive deployments.

This study offers a practical roadmap for organizations, particularly in the public sector, that seek to adopt language model technologies without compromising data security. It demonstrates that CPU-only systems are capable of supporting useful LLM applications without the need for expensive hardware upgrades. Additionally, it supports the use of locally hosted models as a way to reduce privacy risks associated with cloud-based AI platforms, providing a secure alternative for processing sensitive or proprietary information. These findings may help guide procurement, deployment, and policy decisions in environments where secure, autonomous AI systems are needed.

Author Contributions

Conceptualization, C.G., A.S., T.W. and B.L.; Formal analysis, C.G. and A.S.; Investigation, C.G. and A.S.; Methodology, C.G., A.S. and E.T.; Project administration, E.T., T.W. and B.L.; Writing—original draft, C.G., A.S. and E.T.; Writing—review & editing, E.T., T.W. and B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The textbook used in this work is publicly available [24].

Conflicts of Interest

The authors declare no conflicts of interest. The views expressed are those of the authors and do not reflect the official guidance or position of the United States Government, the Department of Defense, the United States Air Force, the United States Space Force or any agency thereof. Reference to specific commercial products does not constitute or imply its endorsement, recommendation, or favoring by the U.S. Government. The authors declare this is a work of the U.S. Government and is not subject to copyright protections in the United States. This article has been cleared with case numbers MSC/PA-2024-0263 and 88ABW-2024-0902.

Appendix A

Table A1 and Table A2 contain the data for one of the exams taken by Llama 2 Chat 7B (Model 2), including each question, the model’s response, and the ChatGPT 4o response which is used as the comparison when calculating the metrics on the right-hand side.

Table A1. Questions 1–5: A sample graded exam taken by Llama 2 Chat 7B, with questions, model response, answers and metrics.

Question	Llama 2 Chat 7B (Model 2) Response	Textbook Excerpt for Comparison [22]	Metrics BERTScore/METEOR
1. What are the five functions of managers?	Based on the provided context, the five functions of managers are: Planning: This function involves setting a purpose or goal and establishing the means for achieving it consistent with higher-level organizational goals, resources, and constraints in the environment. Organizing: This function involves hiring, training, and gathering people into a team with specified authority, responsibility, and accountability relationships; acquiring and allocating materials, capital, and other resources; and creating an organization structure with policies, procedures, reporting patterns, and communication channels. Directing and Motivating: This function involves leading and guiding people to attain the goal, including communicating expectations, providing feedback, and building team morale. Controlling: This function involves monitoring work performance with respect to the goal and taking necessary action whenever work deviates from the goal. Assessing: This function involves evaluating the effectiveness of the first four functions and determining whether the goals or the functions need to be changed. It’s important to note that these functions are not mutually exclusive, and managers may perform them in different combinations depending on the situation. Additionally, the specific responsibilities of managers can vary greatly based on their functional area and managerial level within an organization.	(page 70) The activities of managers, including project managers, can be classified into the five functions. First is deciding what has to be done and how it will be done. This is the planning function, which involves setting a purpose or goal and establishing the means for achieving it consistent with higher-level organizational goals, resources, and constraints in the environment. Second and related to planning is arranging for the work to be done; this is the organizing function. This involves (1) hiring, training, and gathering people into a team with specified authority, responsibility, and accountability relationships; (2) acquiring and allocating materials, capital, and other resources; and (3) creating an organization structure with policies, procedures, reporting patterns, and communication channels. Third is directing and motivating people to attain the goal. This is the leadership function. Fourth is monitoring work performance with respect to the goal and taking necessary action whenever work deviates from the goal; this is the control function. The four functions are aimed at the goal, which implies a fifth function: assessing the four functions to determine how well each of the functions is doing and whether the functions or the goals need to be changed.	0.9053/0.5509
2. What are the four phases of the systems development lifecycle?	Based on the provided context, the four phases of the systems development lifecycle are: Conception phase (Phase A)—This phase involves formulating the problem, defining needs and user requirements, evaluating alternative solutions, and preparing a proposal to conduct the project. Definition phase (Phase B)—In this phase, the contractor begins a detailed analysis of the system concept, defines requirements the system must fulfill, and determines the major subsystems, components, and support systems of the proposed system. Execution phase (Phase C)—This phase involves the design, build, and test phases of the project. The major tasks in this phase are to design the system, assemble its components, and conduct acceptance tests to ensure that the system meets the requirements defined in Phase B. Operation phase (Phase D)—In this phase, the system is put into operation, and maintenance and support activities are conducted to ensure the system operates effectively and efficiently. These four phases constitute the project life cycle, with Phases A, B, and C making up the project life cycle, and Phase D representing the operational phase of the system.	(page 170) The project life cycle is part of a larger life cycle called the systems development cycle (SDC). Virtually all human-made systems follow the four phases of this cycle: Conception phase (Phase A) Definition phase (Phase B) Execution phase (Phase C) Operation phase (Phase D) The project life cycle typically spans Phases A, B, and C—conception, definition, and execution. When Phase C ends, so does the project. At that point the system enters Phase D, operation; the system transits from being the end-item of a project to an operational entity.	0.8783/0.4179
3. What are the four costs of quality?	According to the given context, the four costs of quality are: Prevention: costs of training, design reviews, and activities aimed at preventing errors; includes cost of quality planning. Appraisal and control: costs of evaluating products and processes, including product reviews, audits, tests, and inspections. Internal failure: costs associated with nonconformities discovered by the producer; includes costs for scrap, rework, and retest. External failure: costs incurred from product failures after delivery to the customer; including costs for replacements, warranty repairs, liability, lost sales, and damaged reputation.	(page 650) The costs of quality are classified as prevention, appraisal and control (costs of conformance), internal failure, and external failure (costs of nonconformance): Prevention: costs of training, design reviews, and activities aimed at preventing errors; includes cost of quality planning. Appraisal and control: costs of evaluating products and processes, including product reviews, audits, tests, and inspections. Internal failure: costs associated with nonconformities discovered by the producer; includes costs for scrap, rework, and retest. External failure: costs incurred as a result of product failures after delivery to the customer; includes costs for replacements, warranty repairs, liability, lost sales, and damaged reputation.	0.9704/0.8134
4. What are the five responsibilities of the project engineer?	Based on the provided context, the five responsibilities of the project engineer are: Oversees product or system design and development. Translates performance requirements into design requirements. Coordinates and directs the work of the functional areas and subcontractors. Plans, monitors, evaluates, and documents progress in the design and testing of subsystems and the overall system. Oversees configuration management and the change control system.	(page 1047) The project engineer (also known as systems engineer or systems designer) shoulders responsibility for coordinating technological areas and assures integrated design of the end-item. When several functional areas or subcontractors are involved, the project engineer: Oversees product or system design and development. Translates performance requirements into design requirements. Coordinates and directs the work of the functional areas and subcontractors. Plans, monitors, evaluates, and documents progress in the design and testing of subsystems and the overall system. Oversees configuration management and the change control system.	0.9564/0.7493
5. What are the three common types of programs?	Based on the provided context, the three common types of programs are: Goal-Oriented Programs: These programs are designed to implement an organizational strategy or change, or develop and implement a new application or technology. They coordinate projects and other activities to achieve overarching benefits tied to business strategies and broad organizational goals. Examples include the green vehicle program and the Cosmic Mercury Exploration program in Case 17.4. Improvement Programs: These programs provide regular enhancements to existing systems, processes, or infrastructure through advances provided by individual projects. They serve as a framework for dealing with requests from throughout the organization for added functionality, capacity, or performance—even maintenance. The construction company mentioned in the context is an example of such a program. Portfolio Programs: These programs are classified based on how they are initiated. Some programs are created around a clear strategy, while others “emerge” when someone recognizes that pre-existing projects could be better managed if they were organized and coordinated. If the projects are largely independent, they might be grouped into a portfolio; if they are related and contribute to a greater purpose, they might be grouped into an improvement-related program. In summary, these three types of programs prioritize and coordinate sets of projects and other activities to meet program goals and attain benefits not achievable from any single project.	(page 1161) Among common types of programs are goal-oriented, improvement, and portfolio. A goal-oriented program is a group of projects and other activities that, combined, implement an organizational strategy or change, or develop and implement a new application or technology… An improvement program provides regular enhancements to existing systems, processes, or infrastructure through advances provided by individual projects… A portfolio program is a group of projects that are otherwise independent but share something such as resources or technology. The purpose of the program is to coordinate the projects vis-a-vis each other, allocate shared resources, or consolidate procedures so as to improve performance of the overall set of projects…	0.8853/0.4704

Table A2. Questions 6–10: A sample graded exam taken by Llama 2 Chat 7B, with questions, model response, answers and metrics.

Question	Llama 2 Chat 7B (Model 2) Response	Excerpt for Comparison—ChatGPT 4o	Metrics BERTScore/METEOR
6. What is the name of Chapter 2 of the project management book and make a 100-word summary of the chapter?	Chapter 2 provides an overview of project management principles, systems methodologies, and the systems approach. It covers the origins and concepts of project management, situations where it is needed, and examples of applications. The chapter also introduces the idea that project management is a philosophy that underlies all human-made systems, from simple tools to complex organizations.	Chapter 2 of ‘Project Management for Engineering, Business, and Technology’ discusses the systems approach, viewing projects as systems comprising people, equipment, materials, and facilities organized to achieve goals. This chapter introduces the systems approach and its methodologies, including systems analysis and systems engineering, crucial for managing technical projects. It explains a system’s thinking, which involves understanding a system’s parts, their interactions, and their contribution to the system’s purpose. Key concepts include defining system objectives, resources, and constraints, and using models to understand and manage systems effectively. The chapter emphasizes a holistic view, ensuring project elements are integrated to achieve overall objectives.	0.8770/0.1556
7. What is the name of Chapter 4 of the project management book and make a 100-word summary of the chapter?	Chapter 4 focuses on planning and control methods, procedures, and systems used in project management. The chapter covers various techniques for creating a detailed project plan, including Wideman’s’ Project Management Method and the Delphi Technique. It also discusses different types of project scheduling, such as Gantt charts and critical path method, and explains how to estimate project costs and create a project budget. Additionally, the chapter covers important control procedures, including progress reporting, performance measurement, and conflict resolution.	Chapter 4 of ‘Project Management for Engineering, Business, and Technology’ covers Project and System Definition. It focuses on Phase B, which involves defining the project and system to ensure they meet user requirements and stakeholder needs. This phase includes detailed planning, specifying work tasks, resources, schedules, and budgets. It emphasizes the importance of thorough definition to avoid problems during execution. The chapter also discusses phased or rolling wave project planning, allowing for iterative refinement of plans. A successful project requires a well-defined system and project plan, integrating user and system requirements comprehensively.	0.8729/0.2887
8. What is the name of Chapter 10 of the project management book and make a 100-word summary of the chapter?	Chapter 10 focuses on the importance of organizational structure, leadership, and human behavior in project management. The chapter covers topics such as organizational design, team building, communication, motivation, and conflict resolution. It also discusses the role of stakeholders and their impact on project success. The chapter highlights the need for effective management to ensure that projects are completed on time, within budget, and with desired outcomes.	Chapter 10 of ‘Project Management for Engineering, Business, and Technology’ focuses on Project Risk Management. It explains the concepts of risk, emphasizing that risk is inherent in all projects and can lead to outcomes deviating from the plan. Key components include risk identification, assessment, response planning, and monitoring. Risk identification involves predicting potential problems, while assessment evaluates their likelihood and impact. Response planning outlines strategies to mitigate, transfer, or accept risks, and monitoring involves tracking risks throughout the project lifecycle. The chapter underscores the importance of integrating risk management into overall project management to ensure successful project outcomes.	0.8821/0.2774
9. What is the name of Chapter 16 of the project management book and make a 100-word summary of the chapter?	Chapter 16 focuses on organizational structure and culture, leadership styles, motivation, communication, and teamwork. The chapter covers topics such as organizational design, project management office (PMO) roles and responsibilities, leadership skills, and employee motivation. It also discusses the impact of cultural differences on project management and provides guidelines for managing multicultural teams.	Chapter 16 of “Project Management for Engineering, Business, and Technology” emphasizes the importance of managing participation, teamwork, and conflict within project teams. Effective leadership in project management involves a balance of task-oriented and relations-oriented approaches, with an emphasis on participative management to foster commitment and reduce stress. Teamwork is crucial for project success, and methods for building and improving teams are discussed, including techniques for virtual teams. Conflict, inevitable in projects, can be beneficial if managed properly through confrontation and problem-solving. The chapter also addresses managing emotional stress, highlighting the need for social support and participative management to mitigate stress.	0.8749/0.1911
10. What is the name of Chapter 19 of the project management book and make a 100-word summary of the chapter?	Chapter 19 focuses on the corporate context and international project management. It highlights the importance of understanding the cultural, political, and economic differences between countries when managing projects internationally. The chapter covers topics such as cross-cultural communication, team diversity, and global project management standards. It also provides an overview of the various challenges and opportunities associated with international project management.	Chapter 19 of “Project Management for Engineering, Business, and Technology” addresses International Project Management, emphasizing the complexities and challenges inherent in managing projects across borders. It discusses issues like cultural differences, local institutions, geo-national issues, and the importance of understanding local stakeholders. The chapter highlights the role of the project manager and the necessity of having a local representative to navigate these challenges effectively. It underscores the significance of communication, risk management, and building strong team relationships. Additionally, it provides strategies for handling the unique aspects of international projects to ensure their successful execution and completion.	0.9102/0.2482

References

Tian, K.; Mitchell, E.; Yao, H.; Manning, C.D.; Finn, C. Fine-tuning Language Models for Factuality. arXiv 2023, arXiv:2311.08401. [Google Scholar] [CrossRef]
Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. arXiv 2024, arXiv:2307.06435. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.U.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Available online: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 7 April 2025).
PromptEngineer. LocalGPT. Python. Available online: https://github.com/PromtEngineer/localGPT (accessed on 7 April 2025).
Zylon by PrivateGPT. PrivateGPT. Python. Available online: https://github.com/zylon-ai/private-gpt (accessed on 7 April 2025).
Nomic AI. gpt4all. C++. Available online: https://github.com/nomic-ai/gpt4all (accessed on 7 April 2025).
ggml-org. /llama.cpp. C++. Available online: https://github.com/ggml-org/llama.cpp (accessed on 7 April 2025).
Ollama. ollama. Go. Available online: https://github.com/ollama/ollama (accessed on 7 April 2025).
Intel^® Core^TM i7-8650U Processor. Available online: https://www.intel.com/content/www/us/en/products/sku/124968/intel-core-i78650u-processor-8m-cache-up-to-4-20-ghz/specifications.html (accessed on 7 April 2025).
Intel^® Xeon^® Processor E5-2667 v2. Available online: https://www.intel.com/content/www/us/en/products/sku/75273/intel-xeon-processor-e52667-v2-25m-cache-3-30-ghz/specifications.html (accessed on 7 April 2025).
Tuggener, L.; Sager, P.; Taoudi-Benchekroun, Y.; Grewe, B.F.; Stadelmann, T. So you want your private LLM at home? A survey and benchmark of methods for efficient GPTs. In Proceedings of the 2024 11th IEEE Swiss Conference on Data Science (SDS), Zurich, Switzerland, 30–31 May 2024; pp. 205–212. [Google Scholar] [CrossRef]
Tripathi, O.M. GGUF Models and Quantization. SSRN 2024. SSRN:5044207. [Google Scholar] [CrossRef]
microsoft/Phi-3-mini-4k-instruct · Hugging Face. Available online: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct (accessed on 7 April 2025).
meta-llama (Meta Llama). Available online: https://huggingface.co/meta-llama (accessed on 7 April 2025).
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
meta-llama/Llama-2-7b-chat · Hugging Face. Available online: https://huggingface.co/meta-llama/Llama-2-7b-chat (accessed on 7 April 2025).
Lang, J.; Guo, Z.; Huang, S. A Comprehensive Study on Quantization Techniques for Large Language Models. arXiv 2024, arXiv:2411.02530. [Google Scholar] [CrossRef]
TheBloke (Tom Jobbins). Available online: https://huggingface.co/TheBloke (accessed on 7 April 2025).
meta-llama/Llama-2-13b-chat · Hugging Face. Available online: https://huggingface.co/meta-llama/Llama-2-13b-chat (accessed on 7 April 2025).
mistralai/Mistral-7B-Instruct-v0.2 · Hugging Face. Available online: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 (accessed on 7 April 2025).
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
Nicholas, J.M.; Steyn, H. Project Management for Engineering, Business and Technology, 5th ed.; Routledge: New York, NY, USA, 2017; Available online: https://routledgetextbooks.com/textbooks/9781138937345/ (accessed on 7 April 2025).
Na, S.; Jeong, G.; Ahn, B.H.; Young, J.; Krishna, T.; Kim, H. Understanding Performance Implications of LLM Inference on CPUs. In Proceedings of the 2024 IEEE International Symposium on Workload Characterization (IISWC), Vancouver, BC, Canada, 15–17 September 2024; pp. 169–180. [Google Scholar] [CrossRef]
Ganesan, K. ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks. arXiv 2018, arXiv:1803.01937. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics—ACL ’02, Philadelphia, PA, USA, 7–12 July 2002; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; p. 311. [Google Scholar] [CrossRef]
Hu, T.; Zhou, X.H. Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions. arXiv 2024, arXiv:2404.09135. [Google Scholar] [CrossRef]
Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization; Goldstein, J., Lavie, A., Lin, C.Y., Voss, C., Eds.; Association for Computational Linguistics: Ann Arbor, MI, USA, 2005; pp. 65–72. Available online: https://aclanthology.org/W05-0909/ (accessed on 7 April 2025).
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv 2020, arXiv:1904.09675. [Google Scholar] [CrossRef]

Figure 1. Statistical comparison of workstation BERTScore metrics. For each model ten boxplots are generated, representing questions 1–10 with question 1 on the left side. Each horizontal dashed line shows the mean of means for each model across questions 1–10.

Figure 2. Statistical comparison of workstation METEOR metrics.

Figure 3. Statistical comparison of laptop BERTScore metrics.

Figure 4. Statistical comparison of laptop METEOR metrics.

Table 1. Description and parameters of the four models used in this work.

Model	Model Name	Parameters	Memory	Creator	Quantization
1	Phi-3-Mini-4k-Instruct	3.8 billion	7.2 GB	Microsoft	None
2	Llama 2 Chat 7B	7 billion	3.3 GB	Meta AI/TheBloke	4-bit Medium
3	Mistral 7B Instruct v0.2	7 billion	4.37 GB	Mistral/TheBloke	4-bit Medium
4	Llama 2 Chat 13B	13 billion	7.87 GB	Meta AI/TheBloke	4-bit Medium

Table 2. Questions asked to the models and the sources for the answers to each question.

Number	Question	Source
1	What are the five functions of managers?	Chapter 1.1
2	What are the four phases of the systems development lifecycle?	Chapter 3.2
3	What are the four costs of quality?	Chapter 9.2
4	What are the five responsibilities of the project engineer?	Chapter 15.5
5	What are the three common types of programs?	Chapter 17.5
6	What is the name of Chapter 2 of the project management book and make a 100-word summary of the chapter?	Chapter 2
7	What is the name of Chapter 4 of the project management book and make a 100-word summary of the chapter?	Chapter 4
8	What is the name of Chapter 10 of the project management book and make a 100-word summary of the chapter?	Chapter 10
9	What is the name of Chapter 16 of the project management book and make a 100-word summary of the chapter?	Chapter 16
10	What is the name of Chapter 19 of the project management book and make a 100-word summary of the chapter?	Chapter 19

Table 3. Example METEOR and BERTScore metrics with different variations.

	Prediction	METEOR	BERTScore
Reference statement	Artificial intelligence is an increasingly prominent topic across multiple domains.	1.0	1.0
Minor modification	Artificial intelligence is a progressively popular topic across various domains.	0.672	0.959
Negative	Artificial intelligence is not an increasingly prominent topic across multiple domains.	0.986	0.975
Unrelated statement	Language models generate text based on patterns they learned during training.	0.0	0.859

Table 4. Inference timing metrics for each model on the laptop and workstation.

	Laptop				Workstation
	Model 1	Model 2	Model 3	Model 4	Model 1	Model 2	Model 3	Model 4
name	Phi-3-Mini-4k 3.8B	Llama 2 Chat 7B	Mistral 7B Instruct	Llama 2 Chat 13B	Phi-3-Mini-4k 3.8B	Llama 2 Chat 7B	Mistral 7B Instruct	Llama 2 Chat 13B
prompt time (s)	62.9 ± 31.8	70.1 ± 37.1	71.9 ± 37.8	149 ± 74	10.7 ± 5.4	21.2 ± 11.0	21.9 ± 11.3	40.7 ± 21.1
generation time (s)	172 ± 66	55.4 ± 21.7	40.5 ± 18.8	120 ± 35	61.3 ± 26.9	18.6 ± 6.9	14.2 ± 7.5	32.9 ± 10.3
per-token time (ms)	533 ± 14	316 ± 7	311 ± 6	683 ± 6	184 ± 3	106 ± 2	103 ± 1	187 ± 2

Table 5. Model loading times for workstation and laptop configurations.

Model	Model Name	Parameters	Workstation Load Time (s)	Laptop Load Time (s)
1	Phi-3-Mini-4k- Instruct	3.8 billion	11.5	58.8
2	Llama 2 Chat 7B	7 billion	19.6	60.1
3	Mistral 7B Instruct v0.2	7 billion	20.8	64.4
4	Llama 2 Chat 13B	13 billion	37.0	135.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tyndall, E.; Wagner, T.; Gayheart, C.; Some, A.; Langhals, B. Feasibility Evaluation of Secure Offline Large Language Models with Retrieval-Augmented Generation for CPU-Only Inference. Information 2025, 16, 744. https://doi.org/10.3390/info16090744

AMA Style

Tyndall E, Wagner T, Gayheart C, Some A, Langhals B. Feasibility Evaluation of Secure Offline Large Language Models with Retrieval-Augmented Generation for CPU-Only Inference. Information. 2025; 16(9):744. https://doi.org/10.3390/info16090744

Chicago/Turabian Style

Tyndall, Erick, Torrey Wagner, Colleen Gayheart, Alexandre Some, and Brent Langhals. 2025. "Feasibility Evaluation of Secure Offline Large Language Models with Retrieval-Augmented Generation for CPU-Only Inference" Information 16, no. 9: 744. https://doi.org/10.3390/info16090744

APA Style

Tyndall, E., Wagner, T., Gayheart, C., Some, A., & Langhals, B. (2025). Feasibility Evaluation of Secure Offline Large Language Models with Retrieval-Augmented Generation for CPU-Only Inference. Information, 16(9), 744. https://doi.org/10.3390/info16090744

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Feasibility Evaluation of Secure Offline Large Language Models with Retrieval-Augmented Generation for CPU-Only Inference

Abstract

1. Introduction

Objectives

2. Materials and Methods

2.1. Framework Selection

2.2. Hardware Selection

2.3. Model Selection

2.4. Resource Selection

2.5. Question Selection

2.6. Metric Selection

2.7. Implementation

3. Results

3.1. Workstation Results

3.2. Laptop Results

3.3. Latency Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI