1. Introduction
The rapid advancement of digital transformation and industrial automation has made intelligent customer service chatbots essential for enterprises seeking to improve efficiency and reduce manual workload. However, many conventional chatbots still rely on rigid rules or keyword matches, and they often fail when faced with unstructured, heterogeneous enterprise data. This gap is especially pronounced in the import trade industry, where critical information is stored in disparate formats such as PDF reports, Excel sheets, meeting minutes, and quality-control documents. These sources capture complex, shifting processes like inspection workflows, inventory changes, and return policies. In this context, traditional systems generally lack the semantic depth needed to generate context-aware and accurate responses.
Retrieval-Augmented Generation (RAG) offers a practical solution by retrieving relevant text segments from a vector database and combining them with the generative power of Large Language Models (LLMs) like GPT. This approach reduces hallucinations and improves factual accuracy. Prior work has shown RAG’s effectiveness in domain-specific settings, including agricultural technical support [
1]. Yet document-level retrieval alone may miss deep semantic relationships across fragmented enterprise content. Knowledge Graphs (KGs) address this by explicitly modeling entities and relations, enabling cross-document reasoning and better semantic coherence in complex domains like financial reporting [
2]. GPT-based evaluation and reinforcement mechanisms can further improve the system by assessing response quality and updating knowledge automatically [
3].
Despite these progressions, a critical research gap remains in the epistemological alignment of RAG architectures with complex enterprise logic. While traditional vector-based RAG excels at surface-level semantic retrieval, it structurally fails to handle “multi-hop, conditionally constrained enterprise logic”—such as cross-referencing fragmented quality anomaly reports with strict regulatory constraints. The true gap lies in the absence of a mechanism that bridges unstructured semantic representation with deterministic, rule-based reasoning, specifically tailored for the volatile and heavily regulated import trade sector.
This paper uses design science research to develop and evaluate an artifact for fragmented enterprise knowledge and inconsistent customer responses. The framework follows IMPACT criteria [
4], emphasizing:
Interesting: a self-reinforcing knowledge loop combining retrieval and structured relations;
Matching: alignment with the problem of cross-document semantic integration;
Parsimonious: a compact, complementary module set;
Applicable: relevance to operational tasks like anomaly handling and order tracking;
Conceptually rigorous: clear component definitions and roles;
Testable: comparative evaluation across system variants and query categories.
Following the paradigm of Design Science Research (DSR), this study aims to design and evaluate a multi-layered IT artifact—a KG-augmented RAG system—that addresses the “knowledge silos” inherent in import trade operations. Our design objective is twofold: to improve retrieval precision through structured relational constraints and to establish a self-reinforcing knowledge loop through automated evaluation. Specifically, we address the following research questions:
RQ1: How can Knowledge Graph constraints be integrated with RAG retrieval to improve response accuracy and practical applicability in cross-document enterprise queries?
RQ2: Can a dynamic evaluation and retention loop reduce self-confirmation bias and support continuous, automated enterprise knowledge enhancement?
The study has three objectives: (1) construct a multi-format data pipeline that integrates meeting notes and exception logs into a unified Neo4j vector and knowledge graph store; (2) implement a dynamic evaluation framework that scores responses on accuracy, relevance, and business value, then retains high-quality outputs; (3) empirically validate the system using real import-trade operational data for performance, coverage, and practical impact.
Following existing theory development typologies [
5], this paper contributes to the field in three dimensions:
Theoretical Contribution: We extend the paradigm of enterprise knowledge management by conceptualizing enterprise knowledge not as a static repository, but as a dynamic, self-reinforcing loop where generative AI serves simultaneously as a retriever, reasoner, and evaluator.
Novelty: The novelty lies in the architectural synthesis of a dual-retrieval system (Vector + Graph) coupled with an automated “LLM-as-a-judge” reinforcement mechanism that autonomously evaluates and retains high-quality operational responses without continuous human intervention.
Interestingness: This study exposes how LLMs behave when confronted with conflicting or outdated internal corporate data, providing rare, real-world insights into AI reasoning trade-offs within heavily regulated business workflows.
The rest of the paper is organized as follows.
Section 2 reviews the literature on RAG, Knowledge Graphs, and LLM evaluation, and identifies gaps this work addresses.
Section 3 describes the system architecture, including data processing, graph construction, and GPT-based reinforcement.
Section 4 reports the experimental setup, metrics, and results from real-world data.
Section 5 concludes and discusses future integration with ERP systems and multilingual support.
2. Literature Review
The rapid evolution of intelligent customer service systems has been fundamentally driven by the convergence of information retrieval and generative artificial intelligence. This section explores the theoretical foundations of Retrieval-Augmented Generation (RAG), its expansion through Knowledge Graphs (KGs), and the integration of automated evaluation mechanisms that together form the basis of the proposed system.
The inception of RAG technology by Lewis et al. marked a paradigm shift in addressing the inherent limitations of Large Language Models (LLMs), such as factual hallucinations and knowledge cut-off dates. By combining a Dense Passage Retriever (DPR) with generative architectures like BART, the RAG framework allows models to ground their responses in external, verifiable evidence [
6]. However, while traditional RAG improves factual accuracy in open-domain tasks, it often struggles with the deep semantic hierarchies found in enterprise data. In the import trade industry, where information is fragmented across heterogeneous formats such as regulatory PDFs and shipping Excels, simple vector-based retrieval frequently fails to capture the multi-layered relationships between entities. Consequently, recent scholarship has prioritized the integration of structured Knowledge Graphs to provide a more robust semantic backbone. Zhu et al. introduced KG
2RAG, a framework that utilizes graph nodes for semantic expansion and entity linking, thereby enhancing the coherence of responses in queries involving complex causality [
7]. Similarly, Lecu et al. demonstrated the efficacy of coupling RAG with structured medical graphs to ensure high-fidelity, source-aware generation in specialized domains [
8], reinforcing the necessity of structured knowledge in minimizing risk for high-stakes enterprise applications.
Transitioning from theoretical frameworks to practical implementation, the development of enterprise-grade chatbots necessitates a focus on operational robustness and reasoning capabilities. The NVIDIA research team’s FACTS framework (Freshness, Architecture, Cost, Testing, Safety) serves as a critical benchmark, outlining fifteen control points for maintaining dialogue safety and data freshness in commercial environments [
9]. Beyond text retrieval, the ability to reason over structured graphical data—such as internal organizational hierarchies or process flowcharts—remains a challenge. He et al. addressed this via the G-Retriever architecture, which transforms graph structures into natural language node-edge pairs, enabling LLMs to perform multi-hop reasoning over complex business logics [
10]. These advancements underscore the potential for RAG-based systems to move beyond simple FAQ engines toward sophisticated, context-aware decision support tools.
Furthermore, the reliability of such systems is contingent upon continuous evaluation and reinforcement mechanisms. Current research indicates that static benchmarks are insufficient for the dynamic nature of enterprise queries. Yu et al. emphasized that evaluation metrics must evolve beyond simple similarity scores to include faithfulness and business-specific relevance [
11]. To prevent the dissemination of incorrect information when data is missing, Peng et al. proposed the UAEval4RAG framework, which focuses on identifying “unanswerable” queries—a feature particularly vital for the volatile import trade sector where data updates may lag [
12]. To ensure long-term system evolution, Nguyen et al. introduced Reward-RAG, utilizing a CriticGPT model to provide reinforcement signals for both the retriever and generator components [
13]. By adopting such self-improving loops, enterprises can ensure their chatbot systems remain aligned with operational changes.
Table 1 synthesizes these findings, categorizing the core contributions of existing literature toward the development of advanced RAG architectures.
3. Research Methodology
3.1. Theoretical Anchor and Design Principles
This research is positioned within the framework of Design Science Research (DSR), which emphasizes the creation of innovative artifacts to solve practical problems while contributing to the body of scientific knowledge. The design of our KG-augmented RAG system follows the principles of iterative design and evaluation, ensuring that the artifact is grounded in both technical feasibility and operational utility. By integrating structured relational constraints with generative capabilities, we transition the enterprise chatbot from a simple retrieval mechanism to a logically grounded decision-support tool.
The methodology comprises three strategic layers:
- 1.
Data Vectorization: Enterprise documents are segmented into text chunks and converted into vectors, which are then stored in a vector database to establish a retrievable contextual foundation.
- 2.
Relational Structure Construction: QA pairs and document content are converted into knowledge graph nodes and semantic relationships to support query and reasoning over logically structured problems.
- 3.
Response Reinforcement: GPT is used to evaluate response quality. Optimized responses are stored as new knowledge nodes and vectors, forming a self-reinforcing feedback loop.
These components are operationalized across five core modules: Data Preprocessing, QA Dataset Construction, Knowledge Graph Construction, RAG Framework Integration, and Response Evaluation. The architecture is shown in
Figure 1.
3.2. Data Preprocessing
To develop an intelligent customer service system capable of navigating complex enterprise scenarios, a robust preprocessing workflow is established to transform heterogeneous internal documents into structured formats suitable for semantic retrieval. The data corpus for this study encompasses a diverse array of departmental resources, including quality assurance reports, sales records, warehousing logs, and service operations data. These documents, which primarily exist as unstructured or semi-structured PDF, DOCX, and XLSX files—such as inspection reports, meeting minutes, and customer complaint logs—lack standardized field definitions. The preprocessing pipeline is therefore designed to achieve format normalization, semantic chunking, and vector embedding, ensuring the data is prepared for Retrieval-Augmented Generation (RAG) and structured knowledge augmentation.
The initial phase of the pipeline focuses on document format normalization to mitigate the inconsistencies inherent in multi-departmental data sources. This involves the systematic extraction of content from PDF, Word, and Excel files into UTF-8 encoded plain text while stripping non-essential formatting. For tabular data, Excel headers are harmonized and renamed to ensure consistency, with rows being restructured into coherent, paragraph-like records. Furthermore, informal conversation logs from platforms such as LINE are cleaned to remove noise and duplicate entries. Throughout this process, critical metadata, including original filenames and departmental labels, are preserved as annotations to facilitate downstream semantic indexing and maintain data provenance.
Following normalization, the text corpus is subjected to semantic chunking to optimize it for chunk-based vector retrieval. Enterprise documents are segmented into units of 300 to 800 characters, with boundaries defined by paragraph structures and semantic transitions. For exceptionally long passages, further subdivision is performed using punctuation and list markers to ensure that each segment maintains a focused thematic scope. Each resulting chunk is assigned a unique identifier that references its source file and relative position (e.g., 202502Month-Meeting-Para3), enabling precise tracking during the generation phase.
In the final stage, these semantic chunks are transformed into high-dimensional numerical representations through vector embedding and stored within a Neo4j database. Each chunk is converted into a 1536-dimensional vector using the text-embedding-ada-002 model, allowing the system to capture deep semantic nuances. These vector nodes, along with their associated metadata—such as original text, source identifiers, and category labels—are indexed within Neo4j to support rapid semantic similarity searches. This integrated storage architecture not only underpins the contextual retrieval required for RAG generation but also provides a foundation for graph construction and QA dataset alignment. By leveraging this structured approach, the proposed chatbot can deliver traceable and verifiable responses firmly grounded in the enterprise’s unique knowledge base.
3.3. QA Dataset Construction
To enable the RAG system to provide contextually logical and targeted responses to frequently asked questions in real enterprise scenarios, this study designs and constructs an enterprise-specific QA dataset based on internal knowledge sources. This dataset serves as a critical knowledge foundation for the retrieval augmentation stage, particularly in supplementing short-form queries that are often difficult to capture through standard semantic vector retrieval alone. By structuring institutional knowledge into discrete question-and-answer pairs, the system ensures higher precision in addressing domain-specific operational inquiries.
The QA dataset is systematically compiled from a variety of internal materials to ensure comprehensive coverage of the enterprise’s operational landscape. Primary data sources include monthly management meeting minutes, warehousing and quality assurance Standard Operating Procedures (SOPs), and customer complaint records derived from group conversation logs, such as LINE customer chats. Additionally, historical dialogue records concerning sales and shipping inquiries, along with frequently asked questions compiled by department supervisors, were utilized to capture the most pressing informational needs of the organization.
The construction process utilized a specialized prompt engineering strategy to transform these raw materials into structured data. GPT was instructed to simulate an internal manager familiar with all operational procedures, tasked with generating practical “questions and answers” based on specific knowledge boundaries, such as cold storage management and return procedures. This approach followed strict design principles: ensuring all content remained within the provided knowledge limits, maintaining independent interpretability for each question, and anchoring answers in documented institutional standards. The specific system prompt used for this generation process is detailed in Listing 1.
| Listing 1. System Prompt. |
![Software 05 00015 i001 Software 05 00015 i001]() |
To optimize retrieval efficiency, the dataset is categorized into five critical domains: Inbound Inspection, Frozen Warehouse Management, Quality Anomaly Handling, Shipping and Logistics, and Returns and Pricing. These categories address specific operational pain points, ranging from packaging integrity and temperature control to accountability for contaminated goods and substitution mechanisms. Each pair underwent manual revision to ensure linguistic clarity and actionability. For example, in the domain of quality handling, responses are designed to provide clear protocols, such as conditional acceptance and specific documentation requirements for damaged packaging.
Table 2 presents representative examples of the final dataset across these categories.
Finally, to support seamless integration with the Neo4j knowledge graph and vector database, all entries are stored in a structured JSON format. This formatting allows for the inclusion of critical metadata, such as source references, keywords, and importance levels, which enhances the system’s ability to provide traceable and standardized responses. A sample of this structured format is provided in Listing 2.
| Listing 2. JSON Code of a Translated QA Dataset Sample for Cold Storage Management. |
![Software 05 00015 i002 Software 05 00015 i002]() |
3.4. Knowledge Graph Construction
To enhance the semantic reasoning and complex knowledge querying capabilities of the enterprise customer service system, this study introduces a knowledge graph as a structural medium for knowledge storage and retrieval. By designing nodes and edges in the graph, internal enterprise documents, workflows, anomaly cases, root causes, and product attributes can be semantically linked and organized. This structure supports the RAG response stage in semantic indexing, information expansion, and inferential answering.
To effectively convert internal enterprise documents and corresponding QA dataset information into graph format, this study designs a set of GPT-assisted annotation prompts for extracting node relationships. The objective is to assist in the manual transformation of raw knowledge documents into graph-format relational triples. The specific format is illustrated in the prompt below Listing 3.
| Listing 3. Knowledge Graph Prompt. |
![Software 05 00015 i003 Software 05 00015 i003]() |
Based on multi-source internal enterprise documents, this study constructs a domain-specific knowledge graph encompassing operational standards and quality anomalies. This serves as a semantic backbone to support customer service functions such as semantic reasoning and multi-hop querying. In terms of implementation design, the construction of the knowledge graph primarily—but not exclusively—focuses on two structural dimensions, corresponding to two typical application scenarios in enterprise information management: “anomaly traceability” and “operational verification,” as described below:
Product Anomalies: This part is primarily derived from anomaly handling forms and quality assurance records. Through semantic extraction procedures, the system constructs triples in the form of “Product → Anomaly Occurrence → Root Cause Traceability.” Each anomaly event corresponds to a specific product (e.g., “Frozen Pineapple Chunks”), anomaly phenomenon (e.g., “Peel Residue,” “Eyespots”), and a traceable root cause explanation (e.g., “Improper peeling procedure,” “Raw material not properly controlled”). This chain of product anomalies enables the customer service system to provide contextual responses regarding anomaly types, potential risks, and recommended handling procedures during product-related inquiries. It also lays a semantic foundation for future enterprise quality management and risk control.
Operational Standards: This component derives its semantics from internal workflow regulations, operational manuals, and audit guidelines. It constructs structures in the form of “Operational Behavior → Inspection Item → Validation Condition,” covering key operational nodes such as warehousing, packaging, shipping, cold chain logistics, and returns. For example, from the “Cold Storage Inbound and Outbound Process,” the following triples can be extracted: “Cold Storage Handling” → “Storage Temperature” → “Maintain below −18 °C,” or from “Packaging Inspection Guidelines” → “Seal Confirmation” → “Outer Box Integrity” → “Check for wet box affecting cold integrity.” These linked regulation nodes are particularly effective for handling institutional queries such as “What items must be inspected during goods receipt?” or “Can products with improper sealing be accepted?” The system can respond with rule-based standard answers derived from the knowledge graph, enhancing both semantic accuracy and practical applicability.
To realize the integration of semantic retrieval and inferencing capabilities, the constructed knowledge graph in this study comprises approximately 11,253 entity nodes and 8986 semantic relationship edges. The extraction of triples strictly follows a domain-specific ontology designed for the import trade industry. As detailed in
Table 3, we enforce strict Entity Typing, including Product, Issue, Cause, Document, Process, Concept, Condition, and Rule nodes. Similarly, Relation Typing is restricted to a predefined set of semantically clear directed edges such as
CausedBy,
checkItem,
storeRule, and
Process. This standardized typing prevents the proliferation of ambiguous nodes and ensures multi-hop paths remain logically traversable.
To ensure semantic validity and extraction quality, a hybrid verification protocol was implemented. Given the high-stakes nature of enterprise SOPs, the initial graph initialization utilized a GPT-assisted manual transformation pipeline with two layers of control:
- 1.
Prompt-Level Constraints: The extraction model is strictly instructed via system prompts (Listing 3) to “Only create triples when a clear and reasonable relationship can be identified” and to explicitly avoid “speculative relationships based on logical jumps.”
- 2.
Human-in-the-loop Verification: Approximately 20% of the generated triples—as well as all nodes related to safety-critical storage rules—were manually audited by domain experts and department heads to confirm relationship accuracy and alignment with internal audit guidelines before being committed to the database.
Each node is also encoded with a corresponding semantic vector representation, facilitating multimodal linkage between vector data and graph-based knowledge. This structure is specifically designed to enhance multi-hop reasoning performance. By expanding the search scope from an initial entity node across adjacent relationship edges, the system can autonomously trace causal chains (e.g., from an inspection outcome back to the operational handling rule). This allows the RAG workflow to perform high-efficiency queries and responses based on both semantic similarity and structural association, elevating the contribution from merely application-level retrieval to algorithmic structural traversal.
Figure 2 illustrates a representative segment of the constructed knowledge graph focusing on mango products. The graph displays a complete semantic chain: Mango → Inspection Report Content → Inspection Report Values → Inspection Result. By establishing these structured relationships, the system can autonomously trace relevant inspection items and historical reports when processing queries such as “What are the test values for mango inspection?” This multi-hop reasoning capability significantly enhances the factual accuracy and interpretability of the generated responses. To maintain the absolute authenticity of the enterprise’s internal operations and prevent the loss of domain-specific semantic nuances, the data nodes in this visualization are retained in their original Chinese format as stored in the actual warehouse management system. This presentation highlights the system’s robust capability to process and reason over localized, multilingual data within a real-world international trade context. For the reader’s convenience, the core logic and node functions are detailed in the following analysis using English terminology to ensure cross-linguistic conceptual clarity.
3.4.1. Case Study (1): Node Relationships for Anomaly Handling
This case uses an actual anomaly handling document to demonstrate how a product quality anomaly event can be transformed into knowledge graph nodes and semantic relationships, as shown in
Table 4. Taking the “Pineapple Chunk Anomaly Handling Document” as an example, the document records the anomaly phenomenon “peel residue” and further explains its root cause as “failure to wash before peeling.” This structure is decomposed into the following three layers of node relationships:
Document → Involves Product (Pineapple Chunks)
Product → Found Issue (Peel Residue)
Anomaly Phenomenon → Caused By (Unwashed Peeling)
Through such semantic construction of nodes, the system is not only able to comprehend the origin of the anomaly and the affected object, but also enables real-time provision of relevant background information in response to similar future queries. For instance, when a user asks, “Why is there a quality issue with pineapple chunks?”, the chatbot can utilize the knowledge graph to trace back the root cause and responsible party, and provide suggestions for operational improvement. This design supports the quality assurance department in rapidly identifying risk types and formulating corresponding corrective measures, thereby improving decision-making efficiency.
By leveraging this type of structured knowledge, when querying about anomalies, the chatbot can quickly determine their causes and assign responsibility, thereby assisting in rapid decision-making.
3.4.2. Case Study (2): Knowledge Graph for Quality Assurance and Cold Storage Management
This case focuses on internal operational documents and cold storage management regulations, transforming them into node and rule-based relationship graphs as shown in
Table 3, which serve as a reference for daily operational queries and process verification. For example, in the document “Warehouse Quality Operations Workflow,” it is explicitly stated that the inbound inspection process must include the check for “packaging integrity,” and the conditions for integrity include “no external damage, moisture, or deformation to the outer box.” In addition, the “Cold Storage Handling” process is further decomposed into two mandatory conditions:
Such rules and procedures can be semantically validated through knowledge graph query syntax. For example, when customers ask questions such as “Can it be placed in cold storage if it is scheduled for production?” or “Can batches with newer expiration dates be prioritized for shipment?”, the system can generate complete and normative answers through the logical knowledge graph node relationships: Cold Storage → Temp → −18 °C and Cold Storage → (Rule) → FIFO.
The application of this knowledge graph not only enhances the generative logic of the RAG system, but also serves to reinforce QA responses and improve transparency in audit workflows. In the future, it can also be used for automatic document generation, process node tracking, and rule compliance checking.
This knowledge graph can serve as an important foundation for proactively providing quality recommendations and process verification in future systems, and can also assist in assessing the logical correctness of expressions in knowledge-based question answering.
3.5. RAG Framework
To build a customer service chatbot that can reliably answer internal enterprise questions, this study adopts a retrieval-augmented generation (RAG) framework that couples semantic retrieval with controlled response generation. Specifically, the system employs a two-stage fusion strategy: an initial vector retrieval phase followed by graph-based semantic filtering and expansion. The overall pipeline proceeds from query normalization, to multi-source evidence retrieval, and finally to grounded response generation, supported by both a vector database (for unstructured documents) and a knowledge graph (for rule/relationship reasoning).
After receiving a user query, the chatbot performs intent classification, key term extraction, and sentence normalization to map the input into a standardized representation. This representation is used to route the request to an appropriate handling strategy (e.g., predefined QA, process inquiry, quality anomaly handling, or cross-document retrieval). When no exact match is available, the system enters the semantic retrieval stage and gathers supporting evidence from complementary sources:
- 1.
Vector database (unstructured evidence): Internal PDFs, spreadsheets, and meeting notes are segmented into passages and embedded into vectors; approximate nearest-neighbor search returns the top-k semantically similar passages as candidate evidence.
- 2.
Curated QA dataset (high-precision answers): For frequent questions covered by a precompiled QA entry, the corresponding answer is prioritized to ensure deterministic and consistent responses.
- 3.
Knowledge graph (relational constraints): When the query aligns with graph entities/relations, relevant nodes and paths are retrieved to provide rule-based context such as process constraints, parameter ranges, and precedence logic.
These retrieved passages, QA entries, and graph-derived facts are then consolidated into a single context package for generation.
For response generation, GPT is used as the generation engine and is guided by a structured prompt that explicitly separates the problem description from the retrieved evidence summary. This design enforces grounding (answer only using the provided context), operational usefulness (actionable steps), and traceability (referencing relevant procedures/ standards), as shown in Listing 4.
| Listing 4. RAG Response Generation Prompt. |
![Software 05 00015 i004 Software 05 00015 i004]() ![Software 05 00015 i005 Software 05 00015 i005]() |
By coupling multi-source retrieval with constrained prompting, the chatbot produces responses that are grounded, semantically complete, and consistent with the enterprise’s documented procedures and rule logic.
3.6. Response Quality Evaluation and Knowledge Reinforcement Mechanism
In addition to possessing retrieval-augmented generation capabilities, the customer service chatbot proposed in this study further incorporates a GPT-based proactive evaluation mechanism termed the “knowledge reinforcement and retention mechanism.” This mechanism dynamically updates the knowledge base according to the quality of the generated responses. Through continuous dialog-based learning, it forms a semantic knowledge system for enterprises that is both adaptive and traceable. This section provides a detailed explanation of the evaluation metric design, retention workflow, and growth strategy.
3.6.1. Design of Response Evaluation Metrics
To effectively evaluate the quality of responses generated by the chatbot, this study refers to the self-evaluation framework for large language models proposed by Zheng et al., which utilizes the GPT model as a standardized evaluator. This framework conducts multi-dimensional subjective assessments on the generated results of various tasks using consistent criteria, effectively addressing issues such as the high subjectivity and time consumption of human evaluation [
14]. Based on this, a domain-specific response evaluation metric tailored to the import trading context is developed. It consists of four dimensions, each scored on a scale of 1 to 10, with a total maximum score of 40. The indicators are defined as follows:
- 1.
Accuracy: Whether the response aligns with the original query and correctly cites background information.
- 2.
Consistency: Whether the response maintains logical consistency with the retrieved context without deviating from the topic.
- 3.
Applicability: Whether the response provides practical value and actionable guidance for business operations. This metric serves as an automated proxy for evaluating the density of ”actionable information points” and operational step density.
- 4.
Fluency: Whether the response is clearly and naturally articulated without redundant expressions.
Each criterion is scored on a scale of 1 to 10, yielding a maximum total score of 40 points. To mitigate the inherent self-confirmation bias of using a GPT judge, a calibration step was conducted wherein a small, random subset of responses was independently scored by human domain experts. The strong alignment between the human assessments and the model-based scores confirmed the reliability of the GPT judge in distinguishing actionable enterprise advice from mere verbosity. The evaluation results serve as the basis for determining whether the response should be stored in the knowledge base. The GPT model generates a score according to the designed evaluation prompt, returning the four individual scores and the total score as the decision threshold for knowledge retention. Listing 5 presents an example of the scoring prompt design.
| Listing 5. GPT Response Quality Evaluation Prompt. |
![Software 05 00015 i006 Software 05 00015 i006]() |
3.6.2. Dynamic Retention and Knowledge Reinforcement Workflow
After completing a response and undergoing evaluation, the chatbot system executes the following actions based on the obtained score:
If the total score meets the predefined threshold (default: 32 points), the question-answer pair is stored in a standardized format as a vectorized text fragment and added to the vector database for future retrieval. This empirical threshold was chosen based on an “average 8/10” heuristic across the four dimensions. Extensive prototyping with enterprise domain experts identified a score of 8 as the optimal “sweet spot” for reliability and operational utility. Total scores above 36 were found to be overly restrictive, often rejecting factually correct answers due to minor stylistic or fluency issues, while scores below 30 increased the risk of introducing logically inconsistent content into the knowledge base.
If the response includes explicit conceptual nodes and logical relationships, corresponding triples (e.g., “Product → Issue Detected → Cause”) are extracted and stored in the knowledge graph, enriching the graph with new nodes and relations.
This response retention process does not merely expand a static database, but rather builds a dynamic “learning-based knowledge” system that evolves based on the quality of responses. To prevent “graph poisoning” and ensure long-term stability, the system incorporates structural Error Handling and Traceability Mechanisms:
- 1.
Threshold-Based Gatekeeping: New triples are only integrated into the knowledge graph if the cumulative GPT evaluation score meets the strict threshold of 32/40. Responses scoring below this threshold are explicitly rejected and tagged as “Not Stored” to prevent the introduction of inconsistent or unverified operational nodes.
- 2.
Metadata Tagging and Rollback: Every newly integrated node and relation is permanently tagged with metadata, including the source query identifier, retrieval context, and the corresponding evaluation score. This traceability allows system administrators to efficiently execute backward retrieval and perform surgical “rollbacks” or revisions if enterprise SOPs change or if a logical error is later identified, without requiring a full rebuild of the graph.
Each high-quality reply may serve as foundational material for future responses, forming a continuous optimization loop.
3.6.3. Design of Knowledge Reinforcement and Growth Traceability Path
Each successfully stored response records its corresponding source query, retrieval context, and detailed evaluation scores. A storage flag and score identifier are also assigned. In the event of future updates to internal business procedures, these metadata allow for efficient backward retrieval and revision of previous responses via the retrieval mechanism. This structure functions as the foundational mechanism for the chatbot’s self-expanding knowledge capability. It not only enhances long-term response quality but also establishes controllability and traceability for the growth of the enterprise-specific semantic knowledge base.
4. Experimental Results and Analysis
This section reports the experimental setup and provides a comprehensive analysis of the proposed enterprise customer service chatbot. We first describe the evaluation objectives, test data, and system configurations, and then present quantitative results (BLEU/ROUGE-L and GPT-based subjective scoring) followed by qualitative case studies and error analysis to explain the observed performance differences.
4.1. Experimental Design and Objectives
This study validates the effectiveness of the proposed customer service chatbot (RAG + knowledge graph + enterprise-specific QA dataset) for complex, real-world enterprise queries. Three configurations are compared to isolate the contribution of retrieval and structured reasoning: (1) LLM-only generation, (2) RAG-only generation with vector retrieval, and (3) the proposed RAG + KG system. The analysis combines quantitative automatic metrics with model-based subjective scoring, and is further supplemented by qualitative case studies to explain observed behavioral differences.
For evaluation, a GPT-based judge is adopted to score response quality across multiple dimensions, while BLEU and ROUGE-L are used to quantify similarity to reference answers. BLEU measures
n-gram precision and is commonly used to assess lexical and syntactic overlap [
15], whereas ROUGE-L is based on the longest common subsequence and better reflects structural and semantic preservation in longer responses [
16]. These metrics provide a consistent basis for comparing the three system variants, as detailed in the following subsection.
Dual-Layered Evaluation Framework
To ensure strong methodological defensibility, this study adopts a Dual-Layered Hybrid Evaluation Framework. Rather than relying exclusively on a generative LLM judge or traditional overlap metrics, we utilize the GPT evaluator and overlap-based text metrics as mutual counterbalances:
- 1.
Deterministic Policy Compliance (BLEU/ROUGE-L): In enterprise environments, exact lexical overlap is a strict requirement for regulatory compliance. Deterministic metrics like BLEU and ROUGE-L act as a rule-based anchor, ensuring that specific factual expressions—such as temperature thresholds or legal clauses—are preserved exactly as documented in the human-verified reference text.
- 2.
Dynamic Semantic Reasoning (GPT Judge): To address the limitations of n-gram overlap in capturing latent semantic relationships, a calibrated GPT evaluator assesses high-level dimensions such as “Applicability” and “Consistency.” This dual-layered approach ensures that the deterministic precision of lexical metrics mutually compensates for the subjective reasoning of the LLM judge, providing a rigorous assessment of both regulatory compliance and semantic logic.
This dual-layered approach ensures that the deterministic precision of lexical metrics mutually compensates for the subjective reasoning of the LLM judge, providing a rigorous assessment of both regulatory compliance and semantic logic.
4.2. BLEU and ROUGE-L Evaluation Metrics
In Natural Language Generation (NLG) tasks, evaluating the quality of system-generated responses has long been a challenge. To enable objective quantification, this study incorporates two widely accepted standard metrics for RAG and semantic generation tasks—BLEU and ROUGE-L—as part of the evaluation of the three experimental versions. Both are automated evaluation methods with high scalability and consistency, suitable for large-scale response data assessment scenarios. While recognizing that enterprise QA prioritizes factual correctness and grounding, BLEU and ROUGE-L are utilized here as essential supplementary indicators. In strict corporate SOPs (e.g., temperature thresholds like −18 °C or precise inbound inspection checklists), certain prescriptive phrases and domain terminologies must be generated exactly as documented. BLEU measures n-gram precision and exact lexical overlap, ensuring that such regulatory terminologies are preserved without being overly paraphrased. Similarly, ROUGE-L evaluates structural preservation against human-verified ground truths. The primary assessment of grounding precision and hallucination rate is explicitly handled by the “Accuracy” and “Consistency” dimensions of the multi-dimensional GPT-based evaluation (human-calibrated) described in
Section 3.5.
BLEU. BLEU is a metric based on n-gram overlap (i.e., n-gram precision) used to quantify the lexical and structural similarity between a generated response and a set of reference answers. It is sensitive to word order, phrases, and syntactic accuracy, and is therefore suitable for tasks in which responses are expected to match key factual expressions.
The basic BLEU computation formula is as follows:
where the symbols are defined as follows:
Denotes the precision of the n-gram.
Denotes the weight of the n-gram, typically equally distributed, i.e., .
Denotes the brevity penalty term, used to penalize responses that are too short. It is computed as follows:
where
c represents the length of the generated candidate response and
r represents the length of the reference answer. The BLEU score ranges from 0 to 1 (or converted to a 0–100 scale), with higher values indicating greater lexical overlap between the generated and reference texts.
ROUGE-L. ROUGE-L is based on the Longest Common Subsequence (LCS), emphasizing semantic coherence and word order retention between sentences. It is suitable for evaluating the degree of semantic alignment in open-domain question-answering tasks. Its advantage lies in capturing semantic spans and structural linkages within text, making it especially applicable to response tasks characterized by high semantic diversity.
ROUGE-L separately computes the recall, precision, and F1-score of the LCS:
The symbols are defined as follows:
X: Text sequence of the reference answer.
Y: Text sequence of the generated response.
: Length of the Longest Common Subsequence between sequences X and Y.
: Weight adjustment parameter used to balance the relative importance of precision and recall; in this study, is adopted.
The ROUGE-L metric provides a means to evaluate sentence structure and semantic preservation, particularly suitable for assessing the semantic completeness of responses generated by open-domain question-answering systems. In comparison, BLEU is more sensitive to semantic variations and pragmatic flexibility.
4.3. Experimental Settings and System Variants
Following the definition of the evaluation metrics above, the experiment compares three system variants under the same query set to ensure a fair assessment of architectural contributions. Specifically, each variant receives identical user inputs, and differences in output quality are attributed to the presence (or absence) of retrieval and knowledge-graph reasoning:
LLM-only: The baseline setting that directly feeds the query to the language model without any external retrieval.
RAG-only: Adds semantic vector retrieval over enterprise documents (e.g., PDFs, spreadsheets, and conversation logs). Retrieved passages are concatenated with the query as context for generation.
RAG+Knowledge Graph (proposed): Extends RAG-only by additionally retrieving relevant entities, relations, and paths from the knowledge graph (and prioritizing curated QA entries when applicable), enabling richer cross-document integration and rule-consistent reasoning.
By holding the query set and scoring protocol constant, the experiment examines whether the proposed hybrid design yields measurable gains in accuracy, semantic coherence, contextual consistency, and practical applicability in enterprise workflows.
4.4. Test Data and Query Sources
To meet the objective of practical validation, the test queries adopted in this study were sourced from actual customer service records and discussion topics accumulated during operational processes within an import trading company. These queries reflect real business scenarios, characterized by high complexity and cross-document dependencies. The data sources include:
Monthly business meeting records (e.g., February 2024 and February 2025 meetings)
Quality assurance and warehouse operation process documents (SOPs, inbound inspection guidelines)
Customer complaint reports and resolution records (including LINE group conversations)
Business inquiry records related to shipment and pricing
Compiled frequently asked questions provided by department heads
After consolidation, a total of 101 representative queries were compiled. To ensure the credibility of the empirical results, the reference answers (ground truths) for these queries were manually extracted by the research team from verified historical enterprise resolutions, such as customer complaint settlement records and warehouse SOPs. These reference answers were then cross-validated by internal department heads to ensure absolute factual accuracy.
Furthermore, to prevent both “parametric leakage” (structural impossibility in our zero-shot RAG design) and “lexical leakage” (superficial keyword matching), the raw historical queries were heavily rewritten. The researchers paraphrased the queries to use different terminologies and syntactical structures than the underlying retrieval corpus, forcing the system to rely on deep semantic understanding and multi-hop graph reasoning rather than simple string matching.
The content of these 101 queries spans six major categories, as shown in
Table 5:
Quotation and pricing inquiries
Timeliness of information
Clarity of responsibility in handling
Customer complaint handling procedures
Document verification (e.g., COA, inspection reports)
Conceptual ambiguity detection (e.g., terminology, specifications)
All questions were rewritten by the researchers based on enterprise workflow semantics into self-contained queries. To strictly avoid data leakage, the 101 test queries were maintained completely independent from the QA dataset used for retrieval (described in
Section 3.3). Crucially, we would like to clarify that the QA dataset serves solely as a static retrieval corpus (external contextual evidence) and is not used as a training set for model fine-tuning. No Large Language Model parameters were updated during this study, thus ensuring the absence of structural or parametric information leakage. We ensured that the evaluation queries tested novel scenarios or combinations of facts that were not verbatim replicated in the retrieval corpus, verifying graph independence. Each query was cross-validated against relevant documents, QA datasets, and knowledge graph nodes to ensure the presence of clear standard answers and verifiable references. This methodological “overlap” is necessary in RAG evaluation to measure the system’s retrieval accuracy and synthesis reliability over a fragmented corporate database, rather than its ability to handle unanswerable queries. To respect enterprise confidentiality, identifying product and personnel details within the queries have been anonymized.
4.5. Experimental Results
A total of 101 representative enterprise queries were tested in this experiment, reflecting real-world operational scenarios and information needs within the company. Each system’s response was evaluated by a GPT-based evaluator using four subjective scoring dimensions: semantic accuracy, contextual consistency, practical applicability, and fluency. The maximum total score was 40 points. In addition, BLEU and ROUGE-L metrics were employed to quantitatively assess the similarity between generated responses and reference answers.
The overall score statistics are summarized in
Table 6:
Overall scores and improvement margins. As shown in
Figure 3, the performance of the three systems based on GPT scores is as follows:
LLM Only (Large Language Model Only): Average score was 28.65 points.
RAG Only (Retrieval-Augmented Generation Only): Average score was 31.81 points.
RAG + Knowledge Graph (RAG + KG): Average score was 34.58 points.
Compared to the baseline LLM-only system, the RAG-only configuration achieved an improvement of +11.0%, while the proposed RAG + Knowledge Graph system achieved a substantially higher improvement of +20.7%. These results provide a direct answer to RQ1, demonstrating that the integration of explicit relational constraints from Knowledge Graphs with unstructured RAG retrieval substantially enhances semantic reasoning and contextual augmentation, enabling measurable improvements in response quality across documents and contexts in enterprise-level tasks (as illustrated in
Figure 4).
Winning query count statistics. Among the 101 tested enterprise queries:
RAG + Knowledge Graph system achieved the best responses in 75 queries (74.3%)
RAG-only system achieved the best responses in 15 queries (14.9%)
LLM-only system achieved the best responses in 11 queries (10.9%)
These results indicate that the hybrid semantic response architecture, which combines knowledge graphs with vector-based retrieval, consistently delivers the highest-quality responses across a majority of tasks, demonstrating robust and superior application performance. While formal ablation experiments meticulously stripping individual relationship edges were not independently plotted, the vast performance disparity between the vector-only baseline (RAG-only) and the complete KG-augmented integration confirms that the structural relationship edges account for the majority of the reasoning improvements.
System Winning rate distribution. The winning proportions of each system are further visualized in a pie chart (
Figure 5). It clearly shows that the RAG + Knowledge Graph system dominates with over three-quarters of the wins (74.3%), whereas the LLM-only and RAG-only systems only prevail in cases involving simpler structure or queries containing explicit keywords. These findings support the effectiveness of the proposed knowledge graph augmentation mechanism in addressing semantic ambiguity and scattered document challenges.
Overall observation across multiple evaluation metrics. Integrating both subjective scores and the number of winning queries, the RAG-only system demonstrates initial improvements through semantic retrieval. However, its responses are often constrained by single text fragments and the quality of retrieved sentences, making it difficult to handle complex enterprise-level queries that span multiple documents and hierarchical layers. In contrast, the RAG + Knowledge Graph system significantly improves information integration depth and contextual coherence by incorporating structured enterprise knowledge and logical reasoning among graph nodes. This enhancement also enables more structured and well-grounded responses during the generation stage, leading to overall performance that clearly surpasses the baseline systems.
4.6. Case Analysis and Qualitative Evaluation of Responses
To complement the aforementioned quantitative experiment results, this section presents selected representative queries for qualitative comparison and analysis across different systems. Through case-by-case examination, we observe each system’s performance in terms of information integration, contextual construction, background reasoning, and operational guidance. These analyses further validate the practical benefits of knowledge graph and semantic retrieval technologies in enterprise-level question-answering scenarios.
RAG-only response: “According to the provided documents, there is no explicit mention of a price difference between apple chunks and apple dices. Therefore, I cannot answer this question.” (Score: 21.1)
RAG + Knowledge Graph response: “There may be a price difference between apple chunks and apple dices due to differences in processing procedures and usage. Apple dices typically require more refined processing, which increases production costs and may reflect in higher unit prices. According to the ’Frozen Apple Dice Processing Flow’ document, the processing steps for apple dices include peeling, coring, washing, dicing, blanching, or color preservation treatments, all of which contribute to the overall cost.” (Score: 34.4)
Analysis: This case demonstrates that the knowledge graph, by supplementing entity relationships, helps the language model perform cross-document integration and generate responses with background explanations and actionable recommendations, thereby significantly enhancing practical applicability.
By supplementing these entity relationships, the model successfully synthesized background explanations and provided actionable insights. This demonstrates that the KG-RAG framework effectively transforms fragmented enterprise data into a coherent semantic network, significantly enhancing the practical applicability of the chatbot in complex decision-support scenarios.
RAG-only response: “According to the provided documents, the March meeting report identified three cases of packaging damage involving wild blueberries.” (Score: 30.4)
RAG + Knowledge Graph response: “Based on the March meeting report, three instances of wild blueberry packaging damage were identified. In anomaly report 11403, the cause was determined to be insufficient sealing machine heat. For future deliveries, the heat level of the sealing machine will be increased. According to the company’s standard complaint handling process, you may check the latest records in the complaint management folder or contact the Quality Management Department for updated information. Complaint records usually include the product name, batch number, complaint reason, and resolution status.” (Score: 34.2)
Analysis: Although the retrieved facts are similar, the RAG + Knowledge Graph system demonstrates stronger capabilities in enterprise knowledge articulation through this task-oriented evaluation. Notably, it successfully provides multiple “actionable information points”—specifying the responsible department (“Quality Management Department”), detailing the necessary procedural steps (“check the latest records”), and outlining required document parameters (“product name, batch number”). This qualitative, task-oriented metric underscores the framework’s business value over mere n-gram text overlap, highlighting its strength in constructing a holistic view of enterprise operations and addressing the “practical applicability” component of RQ1.
4.7. Error Analysis and Explanation of System Behavioral Differences
This study observed that while the
RAG + Knowledge Graph system dominated overall, there were 26 queries where the baseline systems performed better (15 for RAG-only and 11 for LLM-only). To provide analytical depth, we performed a structured failure category analysis of these cases, as summarized in
Table 7.
Qualitative Category Performance Analysis. Beyond the aggregated scores, a qualitative analysis of performance differences across the six query categories (
Table 5) reveals that the
RAG + Knowledge Graph system significantly excels in categories requiring multi-hop reasoning and structural validation, such as “Conceptual Ambiguity” and “Document Verification.” For instance, disambiguating subtle product processing differences (e.g., Case 1) is inherently assisted by the graph’s ability to maintain clear relational boundaries between entities. Conversely, in the “Timeliness of Information” category, the
RAG-only baseline occasionally outperformed the hybrid system. This occurred when the RAG-only approach directly retrieved the most recent (albeit unstructured) text fragment, whereas the KG-based system sometimes prioritized older, more formally structured SOP nodes or adopted a more conservative tone when encountering conflicting evidence, leading the GPT evaluator to favor the more “direct” RAG-only response.
Trades-offs and Limitations. The structured failure analysis reveals a fundamental trade-off introduced by knowledge graph integration: rule adherence vs. conversational flexibility. The KG-augmented system acts as a semantic constraint layer, which is essential for “policy compliance” but can sometimes lead to “over-constrained conservatism.” In cases involving highly generic entities, traversing the knowledge graph occasionally introduced “edge noise”—tangential paths that distracted the generator from the user’s core intent.
Beyond these failure cases, the structural value of the knowledge graph’s
relationship edges—as opposed to mere entity-node injection—is a core theoretical contribution of this architecture. While a quantitative ablation stripping edges was not performed due to data retention constraints, qualitative evidence in
Section 4.6 underscores their necessity. For instance, the “causative reasoning” displayed in Case 1 required traversing a sequence of processing edges (e.g.,
peeling → dicing) to justify a price difference. A system relying solely on entity-node retrieval (keyword-based) would retrieve the facts but lack the sequential logic to synthesize an actionable explanation. Similarly, the operational dependencies defined in
Section 3.3 (
Table 3) rely on directed edges like
StoreRule to enforce regulatory compliance (e.g., Cold Storage → FIFO). Future research should prioritize a rigorous quantitative sensitivity analysis of edge density to further isolate the contribution of graph topology from semantic embedding quality.
Furthermore, as explored in RQ2, the human-calibrated GPT evaluation loop plays a critical role in mitigating self-confirmation bias by identifying these failure patterns—such as over-conservatism or outdated rules. To further address these risks, we counterbalance the GPT judge’s potential subjectivity by employing deterministic, objective metrics—BLEU and ROUGE-L—against human-verified reference answers. This hybrid approach ensures that the “hallucination-mitigating” claims of the RAG+KG architecture are grounded in objective factual overlap and verified business logic. Future iterations must focus on optimizing knowledge conflict-resolution strategies and assigning temporal weights to graph nodes to resolve outdated rule conflicts.
5. Conclusions
This study presents an intelligent customer service system tailored for the import trade industry by integrating Retrieval-Augmented Generation (RAG), Knowledge Graph (KG) reasoning, and a GPT-based evaluation mechanism. The proposed framework provides a practical architectural reference for domain-specific knowledge management, effectively consolidating fragmented enterprise data—such as meeting minutes, quality reports, and process specifications—into an indexable semantic network. Key innovations include: (1) a hybrid retrieval approach combining vector-based and graph-based reasoning to support multi-hop logical inference; (2) the construction of specialized QA datasets to improve normative adherence; and (3) a self-reinforcing feedback loop that utilizes a GPT-driven evaluator to identify and retain high-quality responses for continuous knowledge base optimization. Experimental results involving 101 real-world queries demonstrate that the system substantially outperforms baseline vector-retrieval models in semantic accuracy, contextual consistency, and fluency, particularly for complex regulatory queries.
Research contributions of this work include validating the feasibility of structured knowledge reasoning for cross-departmental data governance and demonstrating a scalable, modular architecture for enterprise decision support. While this study confirms the effectiveness of the proposed framework within the import trade domain, it remains an industry-specific case study. A primary limitation is the lack of cross-industry generalization testing; due to strict industrial NDAs and the sensitivity of internal operational data, acquiring equivalent high-precision datasets from other trading firms or sectors was not feasible within the scope of this research. Future work is required to validate its transferability and applicability across diverse larger-scale enterprise domains, potentially leveraging emerging open-source enterprise corpora to benchmark cross-sector performance. By transforming unstructured data into a traceable and version-controlled knowledge structure, the system enhances both retrieval precision and operational reliability. Furthermore, translating this architecture into a production environment necessitates strict operational governance: enterprises must maintain detailed audit logs of generated responses, implement risk controls for hallucination under missing data, and continuously manage the caching threshold to avert the retention of plausible but incorrect outputs.
Future work will focus on three primary dimensions to overcome current limitations and enhance industrial applicability:
- 1.
Graph Automation and Scalability: Currently, knowledge graph construction—enco mpassing 11,253 entity nodes and 8986 semantic relationship edges—relies heavily on manual human-in-the-loop supervision ( 20% curation) to ensure logical correctness and enterprise policy compliance. This reliance on manual labor represents a significant bottleneck for cross-industry scaling. Future iterations will explore the integration of few-shot learning, active learning frameworks (to prioritize low-confidence extractions for human review), and the development of specialized automated relation extraction models to substantially increase automated extraction proportions and reduce operational costs.
- 2.
Human-in-the-Loop A/B Testing: A secondary limitation is the absence of task-oriented, blind A/B testing with actual business personnel. While the GPT-based evaluation provides a scalable proxy for usability, future work prior to full production deployment will involve A/B testing with frontline personnel to empirically measure the reduction in ticket resolution time and validate the system’s real-world usability.
- 3.
Cross-Enterprise Validation: Evaluating the system using data from disparate trading companies to validate the framework’s transferability across different business scales and industries.
- 4.
System Integration and Governance: Extending the architecture to integrate deeply with internal ERP and WMS platforms via Function Calling. This evolution towards real-time data access and on-premise governance will ensure robust data security while providing critical decision support for inventory tracking, quotation generation, and intelligent business operations.