1. Introduction
Modern manufacturing enterprises operate across complex, layered systems that range from equipment-level automation to enterprise-wide planning and control. While innovations in robotics, Programmable Logic Controllers (PLCs), Supervisory Control and Data Acquisition (SCADA), Manufacturing Execution Systems (MESs), and Enterprise Resource Planning (ERP) have enabled greater throughput and coordination [
1], inventory search and management remain a persistent operational challenge, especially in legacy environments.
In particular, inventory search systems within industrial settings struggle to keep pace with the evolving complexity and scale of manufacturing data. Despite technological advancements, legacy inventory systems continue to be a persistent bottleneck in manufacturing. Over time, organizations accumulate vast amounts of data across changing software, naming conventions, and reference structures. This evolution often renders historical data unstructured, inconsistent, and difficult to query. Inventory systems, in particular, suffer from fragmentation, making it hard to interpret part names, categories, and usage history without extensive programming or insider knowledge.
To address this critical issue, we present an effective, secure, and intelligent inventory management and recommendation system that has been developed and rigorously tested in a real-world manufacturing setting. Traditional systems rely on keyword-based queries and static filters, forcing users to adjust search terms to locate parts or components iteratively. These methods fall short in handling semantic variability, complex queries, or incomplete input.
This paper introduces a novel framework that leverages Large Language Models (LLMs) to enable intelligent, context-aware inventory search and recommendations. By integrating state-of-the-art techniques, such as Retrieval-Augmented Generation (RAG), which incorporates concepts like data vectorization and vector search, the proposed system significantly enhances both the relevance and efficiency of inventory access. Furthermore, in response to growing concerns over data confidentiality in industrial AI applications, our solution includes robust privacy-preserving mechanisms to secure sensitive information throughout the search process.
LLMs have shown transformative potential across various industries by enhancing communication, streamlining workflows, and supporting data-driven decision-making. In manufacturing, LLMs can interpret ERP and internal database queries more intelligently, addressing persistent communication challenges such as order inconsistencies and feedback loops between customers and producers [
2]. They also support predictive maintenance by classifying work orders, estimating duration, and identifying key failure factors, thereby reducing downtime and improving resource allocation [
3]. Additionally, LLMs help address challenges in data preparation, a significant barrier to the adoption of ML in manufacturing. The labor-intensive nature of data wrangling often limits the scalability of ML solutions. LLMs, however, can automate aspects of this process, enabling non-experts to engage with data science workflows and improving interdisciplinary collaboration [
4]. Their ability to parse complex data and derive actionable insights supports the broader vision of intelligent manufacturing and data-centric operations [
5].
In summary, our contribution is an LLM-enhanced, privacy-aware inventory search and recommendation system specifically designed to address the challenges of legacy industrial data. The solution is validated in a manufacturing context and is extensible to other domains requiring intelligent inventory or product search.
The rest of this article is organized as follows:
Section 2 reviews related work relevant to the methods presented in this study.
Section 3 details the methodology.
Section 4 provides the performance evaluation metrics, while
Section 5 provides the implementation of the proposed methods and the results. Finally,
Section 6 presents the conclusions.
2. Related Work
Emerging inventory management trends address the limitations of traditional search systems by integrating ML and AI to automate and improve search processes. These technologies enable dynamic, context-aware searches, thereby reducing the need for manual query refinement [
6,
7]. Despite this, traditional systems remain prevalent in manufacturing, underscoring the need for continued development.
One significant development is the integration of AI with legacy inventory systems. This approach seeks to overcome the challenges posed by traditional systems, which often struggle with data silos and lack of real-time insights. Singh et al. [
2] discuss innovative strategies that leverage AI to enhance the functionality of existing inventory systems, drawing on case studies that demonstrate successful integrations across various sectors. By utilizing AI, manufacturers can automate data processing, improve demand forecasting, and optimize inventory levels, thereby reducing the need for manual adjustments and enhancing overall operational efficiency.
Furthermore, the application of AI in Just-In-Time (JIT) inventory management is gaining traction. Pal et al. [
8] highlight how these technologies are utilized to elevate demand forecasting accuracy, which is crucial for aligning inventory levels with fluctuating market demands. This integration not only streamlines inventory management but also minimizes waste and reduces holding costs, addressing some of the inefficiencies associated with traditional inventory systems. Another promising trend is the development of AI-driven real-time monitoring systems. Okuyelu et al. [
9] emphasize the importance of real-time quality monitoring and process optimization in manufacturing. By implementing AI systems that continuously analyze inventory data and production processes, manufacturers can make informed decisions quickly, thereby reducing their reliance on manual input and improving responsiveness to inventory changes.
With the advancements in the inference of Deep Learning (DL) and ML models, recommendation systems have also garnered significant attention in manufacturing environments. While they are more suited for platforms such as entertainment streaming, as their quality depends on the vast amounts of collected data, these systems can be incorporated with considerable success. The authors of [
10] examined several different architectures of recommendation systems that have been the focus of various modern developments. They discussed recommendation techniques, the data used in recommendation systems, deep learning, potential applications of recommendation systems, ML algorithms, evaluation metrics, and proposed challenges. The primary concern addressed was preventing each recommendation system from overwhelming the user with excessive information. Clustering is a standard ML technique, and the most common accuracy metrics are Mean, Precision, Recall, and F-measure. Scalability and latency are significant issues that still need to be addressed, though privacy and security were also discussed.
Marcuzzo et al. [
11] focus their work on introducing current trends in recommendation systems, updating the taxonomy, and outlining the different trends in research, as well as the problems that have yet to be addressed. The authors define and discuss item recommendations, learning objectives, ranking, sampling, and taxonomies and provide an overview of the methods, experimental factors, accuracy metrics, recent advancements, and challenges. The relevant factors affecting model design, such as available data and chosen evaluation metrics, are introduced and compared to provide the foundation of knowledge that several recommendation systems reference. The authors emphasize the need for clearly defined testing protocols and benchmarks to create more universal systematic evaluation procedures to indicate the differences in each model’s performance.
The work of He et al. [
12] focuses on tackling problems in collaborative filtering based on implicit feedback. Key topics discussed include learning from implicit data, matrix factorization, neural collaborative filtering, a fusion of generalized matrix factorization and multi-layer perceptron, and the proposed solution performance. The authors indicate that this framework is simple and generic, serving as a guideline for developing new Deep Learning (DL) models that open up a new avenue for future work, especially in extending models to incorporate auxiliary information and building a multimedia recommender system.
Additionally, the use of human-centered design principles in technology implementation is becoming increasingly important. Berretta et al. [
13] argue that incorporating human factors into the design of AI systems can enhance user experience and improve the effectiveness of inventory management tools. This shift towards a more user-centric approach ensures that technology complements human decision-making rather than complicates it, thereby addressing some of the frustrations associated with traditional inventory management methods. Moreover, integrating AI-powered analytics into supply chain management transforms how manufacturers approach inventory optimization. Adegbola [
14] discusses the potential of advanced financial modeling techniques and AI-driven analytics to reduce inventory costs and enhance overall competitiveness. By leveraging these technologies, manufacturers can gain deeper insights into their inventory dynamics, enabling more informed decision-making and enhanced operational performance.
Most existing inventory management solutions leverage ML for search, predictive maintenance, and real-time monitoring while typically remaining confined to text-based queries, manual adjustments, and fixed search parameters. Our approach departs from these conventions by integrating LLMs with RAG to provide context-aware semantic recommendations. Additionally, our privacy-preserving mechanisms address critical data security concerns in industrial settings. Extensive real-world testing further demonstrated the scalability and adaptability of our framework, making it a robust solution for next-generation inventory management. An overall, high-level implementation flow of our application is depicted in
Figure 1. The contributions of this paper are summarized as follows:
Integration of LLMs with RAG, vector embeddings, and ANN search for dynamic, context-aware inventory recommendations.
Incorporation of robust privacy-preserving mechanisms suitable for industrial applications.
Demonstration of scalability and effectiveness through real-world industrial testing.
3. Methodology
Since LLMs are a relatively recent development, the technologies surrounding them are still evolving and continually improving. This rapid pace of advancement means that new techniques and methodologies are frequently introduced, making it a dynamic field. However, despite this ongoing evolution, several foundational concepts are consistently utilized in many LLM-based applications to achieve desired outcomes. This section discusses, in detail, the concepts that are used in this study to develop a robust framework—including techniques like RAG, data orchestration, fine-tuning, context-aware generation, and leveraging large-scale pre-training—that forms the backbone of how LLMs are applied across various domains. Understanding and effectively implementing these concepts is crucial for maximizing the potential of LLMs in real-world applications.
3.1. Large Language Models
LLMs have significantly advanced Natural Language Processing by learning from large, diverse corpora, enabling them to understand and generate human-like text beyond the capabilities of rule-based systems [
15]. Their effectiveness stems from transformer-based architectures (
Figure 2), which process input text using embeddings, positional encoding, self-attention layers, and decoders that predict word sequences through probabilistic outputs. This design supports parallel data processing and captures complex linguistic patterns using deep neural networks.
Models like GPT-4 demonstrate high performance in generating coherent, context-aware responses, making them valuable for tasks such as content generation and dialogue systems [
17]. Beyond text generation, LLMs are being adopted in various fields, including education, government, and recommendation systems. In academia, they support personalized learning and administrative efficiency [
18], while in digital governance they enhance service delivery and citizen interaction via conversational interfaces [
19,
20].
In the work presented in this paper, we have employed OpenAI’s GPT-4 model. GPT-4o (“o” for “omni”) is OpenAI’s flagship multimodal model, supporting text and image inputs with text-based outputs, including structured formats. It features a 128,000-token context window and up to 16,384 output tokens as of 30 September 2023. GPT-4o is optimized for most tasks, offering strong performance across modalities, though audio input is not supported. The model supports key features such as streaming, function calling, structured outputs, fine-tuning, and tool integration (e.g., web search, image generation, code interpreter). It is accessible via multiple endpoints, including chat, batch, and assistants APIs.
3.2. Retrieval-Augmented Generation
While LLMs face numerous challenges, particularly in terms of ethical considerations and generating biased content, hallucination is one of the most prominent issues in applications such as industrial automation. Hallucinations occur when LLMs generate false but plausible-sounding information due to gaps in their knowledge or when they are given too many tokens in a prompt. To address this issue, one of the most robust methods used is Retrieval-Augmented Generation (RAG).
RAG addresses the limitations of LLMs by incorporating an information retrieval component into the text generation process. This integration enables LLMs to access current and domain-specific knowledge from external sources, thereby enhancing the accuracy and relevance of their outputs. By relying less on static training data, RAG helps reduce hallucinations and enhances the reliability of LLMs in important use cases. Instead of supplying all the data, RAG provides the ability to extract only the necessary information relevant to a user’s prompt, enabling more accurate answers.
Several key components are necessary for a successful RAG. The following will discuss each major component used in this paper’s RAG implementation.
3.2.1. Vector Embeddings
Vector embeddings are a fundamental concept in RAG, representing objects such as control panel components as vectors in a continuous vector space. This method captures functional relationships and similarities between parts, supporting intelligent applications like automated part classification, predictive maintenance, and inventory optimization. The embedding process transforms discrete part information into numerical vector representations that reflect both semantic and functional characteristics.
Embedding models play a crucial role in this process by converting words or terms into vectors based on their meanings and usage within a specific context. Trained on large collections of text, these models learn to position semantically related terms closer together in a high-dimensional space [
21]. In the context of control panel manufacturing, for example, embedding models can identify that ‘relay’ and ‘contactor’ are functionally similar and frequently used together, mapping them to nearby points in the vector space. This numerical encoding preserves meaningful relationships, enabling systems to perform tasks such as clustering, classification, and analogy detection more effectively.
Figure 3 and
Figure 4 illustrate how this works. Terms like ‘relay’ and ‘contactor’ appear close to each other due to their similar roles, while components like ‘timer’ are positioned further apart, reflecting their distinct functions. The diagrams also highlight analogical patterns such as the relationship between ‘switch:block’ and ‘button:light’, demonstrating how embeddings capture structure and meaning within technical vocabularies.
We utilize an open-source embedding model, BAAI/bge-small-en-v1.5 [
22]. The bge-small-en-v1.5 model, developed by the Beijing Academy of Artificial Intelligence (BAAI) as part of the FlagEmbedding project, is a compact English text embedding model designed for efficient performance in resource-constrained environments. As a smaller variant of the larger bge-base and bge-large models, it utilizes 384-dimensional embeddings.
3.2.2. Vector Search
Vector search is a technique used to find items that are the most similar to a given query by comparing their vector representations in a high-dimensional space. Unlike traditional keyword-based search, which relies on exact or partial word matches, vector search uses numerical embeddings that capture the semantic meaning of text, images, or other data types. This enables more flexible and accurate retrieval, particularly in instances where relevant content may not share the same vocabulary as the query. By measuring the distance or similarity between vectors using metrics like cosine similarity or Euclidean distance vector search enables systems to return results that are conceptually related, even if they differ in wording or structure.
In the application of this work, we utilized vector search to find the matching vectors based on the user’s query. The query that was converted to a vector itself was processed to find similar vectors from the list of vector embeddings.
The Approximate Nearest Neighbor (ANN) search is a method for efficiently finding points in high-dimensional space that are close to a query point without guaranteeing exact matches. It is beneficial for large datasets where an exact search is too slow or costly. By allowing for slight inaccuracies, ANN significantly speeds up the search, making it practical for applications such as recommendation systems, image recognition, and Natural Language Processing.
Common ANN algorithms include methods like Locality-Sensitive Hashing (LSH), Product Quantization (PQ), Hierarchical Navigable Small World (HNSW) graphs, and tree-based approaches such as KD-Trees and Ball Trees. These techniques reduce search time and memory usage by organizing data in a way that allows for the quick approximation of nearest neighbors. ANN typically begins with dimensionality reduction to simplify computations, and it operates within metric spaces using distance measures, such as Euclidean or cosine similarity, to evaluate the closeness of data points to one another.
BM25 (Best Matching 25) is a widely used ranking function in information retrieval that estimates the relevance of documents to a given query. It is based on the probabilistic retrieval framework and incorporates key factors, including term frequency, inverse document frequency, and document length normalization. The BM25 scoring function rewards documents that contain frequent and rare query terms while penalizing excessively long documents to prevent length bias. The relevance score of a document
D concerning a query
Q is given by
Here, is the frequency of term in document D, is the length of the document, and avgdl is the average document length in the corpus. The parameters and b are typically set to values such as and , respectively. represents the inverse document frequency of term , which gives more weight to informative terms.
In this work, we employed an ANN alongside BM25 to perform a technique called Hybrid Vector Search. While the method’s performance greatly depends on the tuning of the functions, it tends to perform better with large datasets, such as parts.
On the ANN aspect, the cosine similarity search was utilized. Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between them. Unlike other metrics, it is independent of vector magnitude, focusing solely on the orientation of vectors within the vector space.
Mathematically, cosine similarity is defined as the dot product of two vectors divided by the product of their magnitudes:
This property makes cosine similarity especially useful in fields like text analysis and NLP, where documents are often represented as high-dimensional vectors based on term frequency–inverse document frequency (TF–IDF) or word embeddings [
23,
24]. It is frequently preferred in vector similarity searches, due to its ability to handle sparse data efficiently. In information retrieval systems, for example, cosine similarity enables the ranking of documents by relevance to a query, facilitating the retrieval of the most pertinent results [
25,
26]. Additionally, cosine similarity performs well in high-dimensional spaces, where traditional metrics like Euclidean distance may struggle due to the curse of dimensionality [
27].
3.3. Vector Databases
Vector databases differ from traditional databases in both structure and function, offering advantages for applications that rely on similarity search rather than exact matching. Traditional databases store structured data and use rule-based queries. In contrast, vector databases manage high-dimensional embeddings of numerical representations of unstructured data, such as text, images, or video, to enable semantic search using similarity metrics, including cosine similarity or Euclidean distance [
28]. Most vector database systems include built-in support for embedding generation, vector computation, and optimization. These databases are crucial in AI-driven use cases, such as recommendation systems, image retrieval, and personalized search, where relevance depends on meaning rather than keywords. They also provide efficient indexing and storage at scale, support real-time querying, reduce latency in machine learning workflows, and lower the cost and complexity of building custom retrieval solutions [
29,
30].
In this work, we utilized a self-hosted instance of Qdrant, an open-source and commercially available database solution running within a Docker container. Qdrant is a high-performance vector similarity search engine designed for managing and querying high-dimensional vectors with optional metadata, known as payloads. It is well-suited for applications such as semantic search and recommendation systems, where traditional databases often fall short. Qdrant supports distance metrics such as cosine similarity, dot product, and Euclidean distance, as well as B25 Hybrid search, and it utilizes efficient indexing methods like HNSW for fast Approximate Nearest Neighbor search. Data is organized into collections of points, each consisting of a vector, an ID, and optional payloads for filtering and enriched search results. With flexible storage options, a simple API, and support for various deployment environments, Qdrant offers an efficient and scalable solution for vector-based retrieval tasks.
The database instance was optimized to utilize an ANN + BM25 hybrid search for improved vector search. The data was accessed through the REST-API functionality provided by the Qdrant database.
3.4. Data Privacy
Data privacy concerns are increasingly relevant in the use of LLMs, primarily due to the risk of unintentionally retaining or exposing sensitive data from training datasets. While LLMs do not store information in a conventional memory structure, the AI community has concerns that LLM providers may collect user prompts and responses for model refinement. This practice can integrate sensitive information into the model’s knowledge, making it vulnerable to exposure in later interactions.
The primary interface between a user or application and an LLM is the prompt, making any prompt that contains sensitive information a potential privacy risk. Traditional privacy-preserving techniques, such as Differential Privacy (DP), often fall short in the context of LLMs. Shi et al. [
31] highlight that standard DP methods treat all data points uniformly, which can degrade model performance. As an alternative, prompt obfuscation methods have emerged in research as a simple yet effective approach to enhance privacy without significantly impacting utility.
Prompt obfuscation involves transforming the original text or a part of it to obscure its meaning, significantly reducing its readability and recognizability while preserving the ability to recover the original content accurately. This balance ensures both privacy protection and data integrity. Several methods can be used for obfuscation, each with varying levels of complexity and effectiveness. Base64 encoding converts text into an ASCII representation of binary data, making it less readable to humans. ROT13 applies a simple letter substitution by rotating each character 13 positions in the alphabet. Hex encoding represents each character as a two-digit hexadecimal number, while URL encoding replaces special characters with percent-encoded equivalents. Finally, reversing the string provides a basic yet sometimes effective obfuscation by simply inverting the order of characters. The choice of algorithm depends on the desired trade-off between simplicity, obfuscation strength, and ease of reversibility.
This application involves processing customer and component information, some of which is considered sensitive. Applying heavy obfuscation to the entire prompt would degrade the performance of the language model due to the added complexity. To balance privacy and model efficiency, this work utilizes ROT13 to obfuscate only the sensitive words identified prior to prompt construction. ROT13 is a simple Caesar cipher variant that shifts each letter by 13 positions (e.g., ‘A’ becomes ‘N,’ ‘Z’ becomes ‘M’). Although not designed for strong encryption, ROT13 is effective for lightweight text scrambling, making it well-suited for scenarios where obfuscation, not security, is the primary goal [
32]. An example of ROT13 obfuscation is shown in
Figure 5.
5. Implementation and Results
This article implements an LLM-based search framework designed to enhance inventory search and recommendation processes within a control panel manufacturing facility. The framework utilized a dataset provided by our industry sponsor, which includes part numbers, availability, and usage information. Developed as a Python version 3.11-based API, the system integrates seamlessly with internal engineering and production tools, providing flexible access to inventory data. As illustrated in
Figure 6, the architecture is designed to support efficient searches across a large and dynamic inventory, thereby reducing dependence on tribal knowledge and minimizing inefficiencies associated with manual search refinement.
The core pipeline for search and recommendation is detailed in Algorithm 1. When a user submits a query, it is first converted into a 384-dimensional vector embedding, using a transformer-based model. This embedding is then used to search a vector database for similar items, applying a similarity threshold of 0.7 and returning up to five candidate parts. These candidates are subsequently evaluated by the LLM, which generates a context-aware recommendation that is returned to the user. The specific parameters, including embedding dimension, similarity threshold, and maximum result count, were selected based on system testing to balance performance with the computational cost of LLM inference.
A distinguishing feature of the proposed framework is its use of a vector database that dynamically expands as new queries are processed. This capability enables continuous learning, allowing the system to generate increasingly accurate and context-aware search results over time. By serving as an intelligent assistant, the system provides engineers with rapid access to relevant component and design information, thereby streamlining workflows and supporting data-driven decision-making.
Algorithm 1 Inventory Search and Recommendation Pipeline |
- 1:
q: user query - 2:
: embedding model (dimension ) - 3:
: vector database (threshold ) - 4:
N: max results () - 5:
procedure SimpleSearchAndRecommend(q, , , N) - 6:
- 7:
VectorDB_search(, , N, ) - 8:
LLM_recommend(q, C) - 9:
return - 10:
end procedure
|
The framework utilized the
all-MiniLM-L6-v2 transformer model for embedding generation [
33]. To further enhance search and recommendation accuracy, the embedding model was optionally fine-tuned using domain-specific part lists and sample queries, as guided by the model provider [
34]. A high-level overview of this training process is depicted in
Figure 7.
The trained embedding model transformed the complete parts database into a vector database. Each part and its associated characteristics were represented as a single embedding and stored in the vector database. We used Qdrant to store these vector embeddings, both as dense and sparse vectors, utilizing Qdrant’s internal tools. The embedding model maps sentences and paragraphs into a 384-dimensional dense vector space suitable for clustering and semantic search [
33]. Algorithm 2 describes the basic steps of converting parts into embeddings.
The search process begins by taking a user query, converting it to a vector using the embedding model, and performing an initial vector search using Qdrant’s search tools. It then retrieves a specified number (N) of vectors based on both dense and sparse vector matching.
Algorithm 2 Convert text to vector embeddings |
- 1:
P: a list of parts - 2:
V: a list of vector embeddings - 3:
procedure GenerateEmbeddings(P, V) - 4:
- 5:
for each item i in P do - 6:
- 7:
- 8:
end for - 9:
return V - 10:
end procedure
|
These retrieved vectors undergo a secondary vector search to refine results based on specific characteristics such as voltage, amperage, and product availability. Due to LLM token limitations, the program may need to filter and select only a subset of these vectors; in this case, five vectors are chosen to be sent to the LLM. The second vector search uses a library called Faiss, which is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size [
35].
The steps of converting a user query and performing the multi-step vector search are depicted in
Figure 8. The figure contains the experiment example discussed in this experiment section. The selected results are then passed through an encryption module, which encrypts any predefined sensitive information. The ROT13 is used for this purpose with predefined sensitive information such as customer names, proprietary product names, and any contact information. The sensitive information can vary from application to application based on how this main framework and other tools are utilized.
Figure 9 depicts using ROT13 with the selected parts and parts data.
We then present an experimental example to demonstrate the process of part retrieval within the proposed system. The vector search output, shown in
Table 2, reflects the system’s ability to identify similar parts available in inventory that align with the specified characteristics of a given part number and description, using Algorithm 3. For this example, the input requested parts that matched a specified fuse type,
FLNR015, ‘with characteristics’
Fuse, Delay, 250 VAC, 15A, 200 kA.’ The result table shows the first five results of the vector search with the highest similarity scores. These results can now be sent to the LLM for reasoning.
Algorithm 3 Two-Step Vector Search |
- 1:
: query against which to match embeddings. - 2:
: a list of selected embeddings. - 3:
: embedding model. - 4:
: qdrant vector database. - 5:
F: Faiss search - 6:
procedure SearchEmbeddings - 7:
- 8:
- 9:
S: embeddings of each property in the . - 10:
V: selected vector embeddings. - 11:
for each property i in do - 12:
- 13:
end for - 14:
for each embedding j in S do - 15:
- 16:
5 embeddings with highest . - 17:
end for - 18:
return V - 19:
end procedure
|
The system is configured to apply characteristic-specific matching rules, such as avoiding undersized fuses unless explicitly requested by the user. To achieve this functionality, the embedding model is trained and deployed alongside a powerful LLM, such as GPT-4, providing a robust combination for effective part retrieval. Assessing the accuracy of an LLM-based software application is inherently challenging, due to the lack of universally perfect evaluation methods. However, actively analyzing the system’s input and output using a curated dataset can provide valuable insights into its effectiveness and accuracy. In this application, a dataset of Question-and-Answer pairs was generated using a process inspired by the principles of LLM distillation [
36].
Distillation is a widely used method in machine learning, particularly in the context of LLMs, where a larger, more powerful model (the “teacher”) is used to train a smaller, more efficient model (the “student”). The teacher model generates extensive training data, such as Question-and-Answer pairs or other forms of structured outputs, which capture its advanced reasoning, knowledge, and decision-making capabilities. This generated data serves as a simplified and targeted representation of the teacher model’s understanding, allowing the student model to learn from it. The distillation transfers knowledge from the teacher to the student and enables the creation of domain-specific models that are faster, more resource-efficient, and tailored to specific applications. For example, chain-of-thought distillation, an advanced variant of this approach, involves generating step-by-step reasoning Question-and-Answer pairs. This method helps train smaller models to mimic not just the conclusions of the teacher model but also its reasoning pathways, improving the interpretability and reliability of the distilled models.
In this work, while a traditional distillation process was not employed, since the application used a pre-trained high-end LLM (GPT-4o), the distillation principles were leveraged in creating a curated dataset. This dataset, generated using a similar chain-of-thought methodology, was used to evaluate the effectiveness and accuracy of the application rather than to train a new model.
Figure 10 illustrates an example of an advanced distillation process, highlighting the generation of chain-of-thought Question-and-Answer pairs. This evaluation approach ensures that the system’s outputs align closely with the expected results and demonstrates the utility of distillation techniques for assessing LLM-based applications.
The system was thoroughly evaluated using real-world manufacturing inventory data provided by the research sponsors. The dataset included a diverse range of part descriptions, both well-structured and poorly formatted, including entries with special characters or inconsistent terminology. This diversity was intentional to assess the robustness of the RAG-based system and the underlying language model when exposed to noisy, industry-specific input.
To benchmark performance, a total of 200 Question-and-Answer pairs were generated using the language model. A representative subset is shown in
Figure 11. The system responses were then evaluated using the DeepEval framework [
37], a modern, open-source toolkit for assessing LLM outputs based on key quality dimensions.
The evaluation focused on four core metrics:
Answer Relevance,
Faithfulness,
Context Recall, and
Context Precision. The results, summarized in
Figure 12, revealed strong performance across all dimensions: 88.4% for Answer Relevance, 92.1% for Faithfulness, 80.2% for Context Recall, and 83.1% for Context Precision. These findings indicate that the system not only delivers highly relevant and factually accurate responses but also retrieves meaningful context while minimizing irrelevant noise.
Although a subset of generated answers diverged from the expected phrasing, they generally conveyed the correct information and aligned well with user intent. These results affirm the effectiveness of our RAG-based pipeline for part lookup and recommendation tasks, with promising adaptability to other domains involving structured data and context-dependent retrieval.
Table 3 summarizes and contrasts the core features of the traditional SQL search, generic machine learning-based methods, the recent RALLRec [
38] framework, and our proposed LLM + RAG pipeline. The comparison highlights the capabilities of each approach in handling semantic queries, supporting context-aware recommendations, integrating privacy mechanisms, and operating in real-world industrial environments. Our method uniquely combines semantic understanding, advanced retrieval, and privacy features that are not present in other approaches.