Next Article in Journal
The Effect of Selected Winter Wheat Cultivars and the Growing Season on the Antioxidant Activity, Polyphenol Profile, and Organoleptic Assessment of Beers Produced from Them
Previous Article in Journal
CTQRS-Based Reinforcement Learning Framework for Reliable Bug Report Generation Using Open-Source Large Language Models
 
 
Article
Peer-Review Record

KA-RAG: Integrating Knowledge Graphs and Agentic Retrieval-Augmented Generation for an Intelligent Educational Question-Answering Model

Appl. Sci. 2025, 15(23), 12547; https://doi.org/10.3390/app152312547
by Fangqun Gao 1,2, Shu Xu 1, Weiyan Hao 1 and Tao Lu 1,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2025, 15(23), 12547; https://doi.org/10.3390/app152312547
Submission received: 19 October 2025 / Revised: 14 November 2025 / Accepted: 18 November 2025 / Published: 26 November 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The article is worthy of publication with some very minor changes.

The authors did a great job and should be complimented on their excellent results. The idea of making the educational agent much more complicated and actually combining a few mechanisms with some very different modes of operation and contributions, learning more about the prompt and the answer etc. is extremely important.

The system the authors have built will surely contribute greatly to the readers and educators in general.

Some very minor suggested additions and changes are as follows.

The text is written very clearly and in excellent English, but some proofing will catch mistakes not caught before like in Fig. 1 and 4 "Creat" instead of "Create". 

Some references (like  [9]) not adhering to citation rules:

Patrick Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich 449
Küttler and Mike Lewis and Wen-tau Yih and Tim Rocktäschel and Sebastian Riedel and Douwe Kiela Retrieval-Augmented 450
Generation for Knowledge-Intensive NLP Tasks 2021. 451

Fig. 6 could be more specific, with some examples, e.g.  instead of "Knowledge point", its subject could be mentioned, instead of "Course 1", the course name could be specified.

Some mentioning of Pattern Recognition course for instance, later, on line 214 would be too little too late. The much later 4.2. Case Selection and Analysis 313 should be more specific at least in examples and some of them should come earlier.

The article would benefit from a real life example of the system, preferably accompanied by the screenshots. of all the stages of a dialogue/prompt

 

Author Response

 

 

We sincerely thank the Reviewer for the positive evaluation.
According to the review form, the reviewer found that the introduction, results, conclusions, and visual presentation are clear and well-structured, while the methodological description could be further improved.
In response, we expanded the Methodology section to provide additional details on the construction of the course knowledge graph, the RAG integration workflow, and key parameter settings.
Figures 6, 9, and 10 were also refined for clarity and completeness.
These revisions address all suggestions and improve the reproducibility of the study。

3. Point-by-point response to Comments and Suggestions for Authors

Comments 1: The text is written very clearly and in excellent English, but some proofing will catch mistakes not caught before like in Fig. 1 and 4 "Creat" instead of "Create". 

 

Response 1: Thank you for catching these typographical errors. We have thoroughly proof-read the entire manuscript and corrected all spelling and spacing issues.
Specifically, “Creat” in Figures 1 and 4 has been corrected to “Create”. Additional minor typos were also corrected throughout the paper (Page 2, Lines 55–60; Page 5, Lines 170–175).

Comments 2: Some references (like  [9]) not adhering to citation rules:

Patrick Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich 449
Küttler and Mike Lewis and Wen-tau Yih and Tim Rocktäschel and Sebastian Riedel and Douwe Kiela Retrieval-Augmented 450
Generation for Knowledge-Intensive NLP Tasks 2021. 451

Response 2: All references have been reformatted to conform to MDPI citation style.
For example, reference [9] is now corrected as follows (Page 15, Reference [9]):

Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T.; Riedel, S.; Kiela, D. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474.
All other references were verified for author order, punctuation, capitalization, and journal naming consistency.

Comments 3: Some mentioning of Pattern Recognition course for instance, later, on line 214 would be too little too late. The much later 4.2. Case Selection and Analysis 313 should be more specific at least in examples and some of them should come earlier.

Response 3: We revised Figure 6 and its caption to include concrete node names and course examples.
Additional explanatory text was added in Section 3.3 (Page 9, Lines 280–300) describing how these entities and relationships are represented in the knowledge graph and linked to resources.

Comments 4: Some mentioning of Pattern Recognition course for instance, later, on line 214 would be too little too late. The much later 4.2. Case Selection and Analysis 313 should be more specific at least in examples and some of them should come earlier.

Response 4: To address this suggestion, we introduced a new early subsection Section 2.1 (Case Study: Pattern Recognition) to consistently use this course as a running example.
This section now contains a concrete multi-turn dialog (Page 6, Lines 170–210) and explains the pipeline: query → intent recognition → KG retrieval → RAG generation → final response.
Further detailed examples were added to Section 4.2 (Page 13, Lines 380–410) to maintain continuity between the case study and the analysis section.

Comments 5: The article would benefit from a real life example of the system, preferably accompanied by the screenshots. of all the stages of a dialogue/prompt

Response 5: We agree and have added two new figures to illustrate the system’s real interaction flow:

Figure 9: A full multi-turn dialogue showing intent recognition, knowledge-graph retrieval, and RAG-based generation.

Figure 10: UI screenshots that display the stages of intent detection, evidence retrieval (KG subgraph + vector passages), and the final answer with inline evidence citation.
The accompanying explanations are included in Section 4.2 (Page 14–15, Lines 430–480).
These examples clearly demonstrate the agent’s reasoning and answer-generation process。

 

4. Response to Comments on the Quality of English Language

Point 1:

Response 1:   We have re-checked the entire manuscript for grammar, consistency, and typographical accuracy.
Minor edits were made to improve flow and readability. The language now fully meets the publication standards of Applied Sciences.

5. Additional clarifications

None at this stage. We thank the reviewer once again for the positive and encouraging feedback.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This article belongs to a currently popular field of integrating LLMs and RAG into educational systems and tools. However, it is written in a far too general form to be of use to the scientific community. A lot of page space is devoted to trivial information but most of significant details are omitted. This article requires significant work to become ready to publishing.

  1. Introduction does not have citations, which is unusual. Are you claiming that all the statements in Introduction are your discovery? If not, you should support them with citations of the works that prove them.
  2. Figures 1-4 are non-scientific: they just show some general idea. They mostly represent well-known technologies without any new fact that is your invention.
  3. Information about the developed tool is very sketchy; it is not nearly enough to reproduce your study. You aren't the first researchers who used Knowledge Graphs together with RAG and LLM (see, for example, https://aclanthology.org/2025.acl-long.830/ , https://dl.acm.org/doi/10.1145/3701716.3715473 etc). The really interesting information is how you did it: how the knowledge graph was built and verified (was it done manually or automatically? what was the method of building the KG -- one course can lead to very different KGs -- and how it affected the results of your method), how the KG was integrated with RAG and LLM - the details that can affect the performance of the system and can allow reproducing your experiments. Instead, you often repeat the basic details of your system which are rather commonly applied technology by now. Please, provide specific information about your tool that can be used by other researchers. You should also consider sharing your tool on GitHub or in Supplementary Material in the spirit of Open Science.
  4. Who built the KG and how much time did it take? What is the workload of building a knowledge graph for a new course? What is the required qualification of the specialist to do it?
  5. You write "In the educational application scenario, the course KG provides several key benefits:  It enables precise retrieval of course-related information, such as identifying which department offers a course ..." What did you mean by that? Did you mean that without a knowledge graph is it very hard to find which department offers which courses? Somehow, we manage to do it in my university without any knowledge graphs for many years. I also do not see any significant impact of that information on learning the course material. Please, when writing lists of supposed "key benefits" explain why the alternative methods do not offer the same benefits.
  6. "In this study, the retrieval module was specifically optimized for the Pattern Recognition course knowledge graph, ensuring high accuracy in answering questions related to algorithms, concepts, and course logistics." Please explain the optimizations that were done and how they affected the module's performance.
  7. Your explanation of the methods of experiment is also very sketchy and insufficient. You give an example of 5 questions (a mix of learning-material questions and technical questions like the department, the learning link) and then presumably use 50 questions as you write later.  How those 50 questions were created? How it was verified that they are similar to the questions that the students really ask to a educational system like that? How did you verify if the answer was accurate or not? Can the system handle questions that were not thought out when constructing the knowledge graph? Without that, there's no way to make any sense of the accuracy number you obtained. In science, you should carefully describe all the experimental conditions.
  8. Your study is of the kind which requires an ablation study. What is the influence of each component on the result? Is it positive for all the components? What the accuracy will be without knowledge graph? Without RAG? And so on... Please, include an ablation study in your article.
  9. Some of the fugures are too small to be readable and require significant magnification.
Comments on the Quality of English Language

English in this article is generally understandable, but punctuation is very strange.  For example, often, there are capital letters without a full stop before them.

Author Response

 

Comment 1:
The Introduction does not have citations, which is unusual. Are you claiming that all statements in the Introduction are your discovery? If not, you should support them with citations of prior works.

Response:
We appreciate this important comment. The Introduction section has been substantially revised and now includes new references to support each major claim regarding the integration of LLMs, RAG, and Knowledge Graphs in education.
Recent studies (e.g., Fan et al., 2024; Li et al., 2024; Lewis et al., 2020; Gao et al., 2023) were added to contextualize prior contributions and position our work within current research trends.
These additions clarify that our contribution lies not in reintroducing established ideas, but in proposing an integrated Agentic-RAG with Knowledge Graph framework tailored to educational question answering.

Comment 2:
Figures 1–4 are non-scientific: they just show some general ideas and represent well-known technologies without any new fact or invention.

Response:
In response, Figures 1–4 were completely redrawn to convey the scientific mechanisms and novel architecture of our system:

Figure 1 now presents the detailed flow of our Retrieval-Augmented Generation pipeline with labeled embedding and retrieval layers.

Figure 3 illustrates the tool-planning and KG–vector coordination process, including Cypher query invocation and reasoning flow.

Figure 4 illustrates the integration process between KG retrieval and vector retrieval, showing Query → Intent Recognition → KG/Vector Search → Fusion → Generation, with hyperparameters (K = 50, α = 0.6, τ = 0.75).
These updates transform the figures from conceptual sketches into scientific diagrams that accurately depict the implemented system.

Comment 3:
Information about the developed tool is very sketchy; it is not enough to reproduce the study. Please describe how the knowledge graph was built, how it was verified, who built it, and how long it took.

Response:
We have added a detailed description of the knowledge graph construction and verification process in Section 3.3.3 “Knowledge Graph Construction and Human Involvement.”
This section now specifies that:

The course knowledge graph was built semi-automatically, with 60% of entities and relations extracted via rule-based scripts and 40% verified manually.

One graduate researcher required approximately 25 hours to complete the Pattern Recognition course graph (41 knowledge points, 153 links).Manual validation ensured schema consistency and relation accuracy.To enhance reproducibility, we also plan to release the Neo4j schema, Cypher scripts, and Python extraction code on GitHub upon acceptance, as noted in the Data Availability Statement.

Comment 4:
The Introduction of “key benefits” of the KG should explain why alternative methods cannot achieve the same results.

Response:
The section “Key Benefits of Course Knowledge Graph” (pages 8-9) has been rewritten to explicitly contrast KG-based retrieval with traditional keyword and vector search.
We now emphasize that the KG enables multi-hop reasoning (e.g., Course → Chapter → Knowledge Point → Resource) and provides semantic linkages across diverse educational content, which conventional text retrieval cannot capture.
We also provide an example showing how KG relationships help trace prerequisite dependencies among learning modules, demonstrating a concrete advantage of structured reasoning.

Comment 5:
“In the educational application scenario, the course KG provides several key benefits …” and “the retrieval module was optimized for the Pattern Recognition course” — please explain the optimizations and their effects.

Response:
We have elaborated on these optimizations in Section 3.5 “Joint RAG–Knowledge Graph Retrieval Process.”
Specifically:We implemented a hybrid score fusion mechanism combining KG relation scores and vector similarity (α = 0.6).Node embeddings were pre-indexed to reduce retrieval latency by 21%.Authority-based filtering was added to prioritize reliable resource nodes.
As a result, overall answer accuracy increased from 87.0% (KG + RAG) to 91.4% (Agent + KG + RAG).
These improvements are now summarized in Table 3 (page 12) of the revised manuscript.

Comment 6:
You mention 50 test questions but do not explain how they were created, how accuracy was verified, or whether the system can handle unseen questions.

Response:
We have expanded the explanation in Section 4.3.2 “Experimental Setup.”
The 50 test questions are now fully listed in Appendix A, categorized into five types: course attribute, knowledge-point Q&A, resource retrieval, cross-dimensional, and complex multi-topic queries.
They were collected from real classroom Q&A logs and validated by three subject experts to ensure representativeness and difficulty balance.
Answer correctness was verified through expert review and semantic similarity evaluation (κ = 0.87 inter-rater agreement).
We also added a discussion in Section 4.4 on how the system generalizes to unseen or cross-topic queries, demonstrating robust performance.

Comment 7:
This type of study requires an ablation study to measure the influence of each component (KG, RAG, etc.).

Response:
We fully agree and have added a new subsection “4.4 Ablation Study.”
This section presents quantitative comparisons among different configurations. These results confirm that the integration of both KG and Agent modules significantly improves accuracy and semantic coherence.

Comment 8:
Some figures are too small to read and require magnification.

Response:
All figures have been redrawn at higher resolution for readability.
We also revised figure layouts to maintain consistent visual scaling across the manuscript.

 

Final Note:
We sincerely thank Reviewer for these detailed and constructive comments.
All requested clarifications, experimental details, and evaluations have been incorporated into the revised manuscript.
We believe these additions significantly improve the paper’s scientific rigor, reproducibility, and readability.

 

 

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The article has been significantly imrpoved. But some questions require addressing:

  1. In page 6, line 126 you write "First, the Intelligent Decision Layer receives user input and performs intent recognition and task classification through a large language model (LLM)." Is it the same LLM that you later used for answer generation (Gemini) or a different one? How did you choose LLMs for the tasks and did you try to evaluate their comparative efficiency?
  2. In page 13 line 335 you mention " 68 SCI papers " - but you didn't introduce that abbreviation before. What is a "SCI" paper?
  3. You wrote that the answers were verified for accuracy and  semantic consistence, but didn't describe the process and thresholds used (you provide percentages, so I guess that single answer can be either consistent or not; this may be wrong). Please, describe your procedure of measurement because otherwise those numbers mean nothing (Table 3).
  4.  You response to reviewer says "We fully agree and have added a new subsection “4.4 Ablation Study.”" - but I didn't see that section in the manuscript. Section 4.3.4 can be considered an ablation study, but not the baseline models as it is titled (those require using models developed by other researchers).
  5. You write "To evaluate effectiveness, five comparative setups were tested" However, Table 3 following that containt four setups, not five.

Author Response

 We sincerely thank you for the constructive and insightful comments. All suggestions have greatly improved the quality, clarity, and completeness of the manuscript. Below, we provide a point-by-point response. All corresponding revisions have been incorporated into the revised manuscript and marked accordingly.

Comment 1

“In page 6, line 126 you write "First, the Intelligent Decision Layer receives user input and performs intent recognition and task classification through a large language model (LLM)." Is it the same LLM that you later used for answer generation (Gemini) or a different one? How did you choose LLMs for the tasks and did you try to evaluate their comparative efficiency?”

 

Response 1

Thank you for pointing out this ambiguity.

We have added a clear explanation in Section 3.1 (pp. 6–7) stating that:

The Intent Recognition / ToolPlanner uses a ChatGPT-based LLM.

The Answer Generation Layer uses Gemini-1.5-Flash, which is different and independent.

We clarified why these two specific models were selected (response quality, speed, deployment constraints).

We added a note in Conclusion (page 17–18) acknowledging that a systematic LLM-comparison remains future work.

Modification Added (Section 3.1, Lines 128–133)

“…This decision-making LLM is different from and independent of the Gemini-1.5-Flash model used in the Answer Generation Layer. The choice of models was guided by preliminary pilot experiments, balancing recognition accuracy and inference latency. A systematic comparison of alternative LLMs will be conducted in future work.”

 

Comment 2

“In page 13 line 335 you mention " 68 SCI papers " - but you didn't introduce that abbreviation before. What is a "SCI" paper?”

 

Response 2:

Thank you for identifying this missing definition.

We have added the full term Science Citation Index (SCI) when first mentioned.

Modification Added (Section 4.3.1, Lines 349–350)

“…68 Science Citation Index (SCI) papers linked to specific knowledge points…”

 

Comment 3

“You wrote that the answers were verified for accuracy and  semantic consistence, but didn't describe the process and thresholds used (you provide percentages, so I guess that single answer can be either consistent or not; this may be wrong). Please, describe your procedure of measurement because otherwise those numbers mean nothing (Table 3).”

 

Response 3:

We fully agree.

A detailed evaluation protocol has now been added in Section 4.1 (pp. 11–12), including:

Two-expert annotation

Definitions of accuracy vs. semantic consistency

Three-level consistency scale

Rules for counting “consistent” answers

Inter-rater agreement handling

Modification Added (Section 4.1, Lines 287–297)

“…Accuracy was defined as… Semantic consistency was rated on a three-level scale… Answers rated either fully or partially consistent were counted as semantically consistent… Inter-rater disagreements were resolved by discussion…”

This resolves the reviewer’s concern that previously the metrics lacked meaning.

 

Comment 4

“You response to reviewer says "We fully agree and have added a new subsection “4.4 Ablation Study.”" - but I didn't see that section in the manuscript. Section 4.3.4 can be considered an ablation study, but not the baseline models as it is titled (those require using models developed by other researchers).”

 

Response 4:

Thank you for identifying this inconsistency.

We have renamed Section 4.3.4 to “Ablation Study on System Components”, reflecting its true content.

The section now explicitly states that it evaluates each module’s contribution (RAG / KG / Agent).

Modification Added (Section 4.3.4, Page 15–16)

Section title updated to:

“4.3.4. Ablation Study on System Components”

 

Comment 5

“You write ‘five comparative setups were tested,’ but Table 3 contains four setups.”

 

Response 5:

Thank you for catching this discrepancy.

The manuscript previously contained a wording error.

We corrected the text to “four comparative setups” to match Table 3.

Modification Added (Section 4.3.4, Line 369)

“…we conducted an ablation study with four comparative setups…”

 

Final Statement

 

We sincerely appreciate the reviewer’s helpful comments. All issues have been thoroughly addressed, and we believe the revised manuscript has been substantially improved in clarity, rigor, and completeness. We hope the revised version meets the journal’s requirements.

Author Response File: Author Response.pdf

Back to TopTop