MDKAG: Retrieval-Augmented Educational QA Powered by a Multimodal Disciplinary Knowledge Graph

Zhao, Xu; Wang, Guozhong; Lu, Yufei

doi:10.3390/app15169095

Open AccessArticle

MDKAG: Retrieval-Augmented Educational QA Powered by a Multimodal Disciplinary Knowledge Graph

by

Xu Zhao

,

Guozhong Wang

^* and

Yufei Lu

School of Electrical and Electronic Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 9095; https://doi.org/10.3390/app15169095

Submission received: 24 July 2025 / Revised: 10 August 2025 / Accepted: 15 August 2025 / Published: 18 August 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

With the accelerated digital transformation in education, the efficient integration of massive multimodal instructional resources and the support for interactive question answering (QA) remains a prominent challenge. This study introduces Multimodal Disciplinary Knowledge-Augmented Generation (MDKAG), a framework integrating retrieval-augmented generation (RAG) with a multimodal disciplinary knowledge graph (MDKG). MDKAG first extracts high-precision entities from digital textbooks, lecture slides, and classroom videos by using the Enhanced Representation through Knowledge Integration 3.0 (ERNIE 3.0) model and then links them into a graph that supports fine-grained retrieval. At inference time, the framework retrieves graph-adjacent passages, integrates multimodal data, and feeds them into a large language model (LLM) to generate context-aligned answers. An answer-verification module checks semantic overlap and entity coverage to filter hallucinations and triggers incremental graph updates when new concepts appear. Experiments on three university courses show that MDKAG reduces hallucination rates by up to 23% and increases answer accuracy by 11% over text-only RAG and knowledge-augmented generation (KAG) baselines, demonstrating strong adaptability across subject domains. The results indicate that MDKAG offers an effective route for scalable knowledge organization and reliable interactive QA in education.

Keywords:

multimodal disciplinary knowledge graph; retrieval-augmented generation; educational interactive question answering; large language models; entity-aware retrieval; answer verification

1. Introduction

With the rapid progress of science and technology, empowering traditional education through cutting-edge digital techniques, such as artificial intelligence (AI), knowledge graphs (KGs), and big data, has become an urgent task to achieve high-quality development and accelerate the digital transformation of education in today’s educational landscape [1].

In recent years, the widespread deployment of large language models (LLMs) has revealed a pervasive hallucination problem, whereby the models confidently output incorrect or nonexistent answers, thus constraining their applicability in educational scenarios [2]. He et al. [3] evaluated hallucination rates on 2,000 medical-domain questions and found that, among open-source models, ChatGLM3-6B and Baichuan-13B exhibited hallucination rates of 21.6% and 51.8%, respectively, with the latter being attributable to the limited amount of medical data in its training corpus.

Retrieval-Augmented Generation (RAG) [4] effectively enhances the reliability and accuracy of generative models in knowledge-intensive tasks by incorporating external knowledge retrieval mechanisms. Specifically, it first retrieves highly relevant passages from a document collection according to a user query and then feeds these passages, together with the query, into the generator to produce context-aware answers. A typical RAG pipeline includes query encoding, retrieval, context fusion, and answer generation, thereby alleviating knowledge staleness and hallucinations. Cheng [5] compared baseline, fine-tuned, and RAG-enhanced models across diverse question types and showed that RAG delivers the lowest hallucination frequency.

More recently, many studies have employed knowledge graphs (KGs) as retrieval sources, demonstrating superior support for domain-specific QA. For instance, Liang et al. [6] proposed Knowledge-Augmented Generation (KAG), which raised accuracy in governmental and health applications to 80%. However, educational resources are inherently multimodal—text, images, audio, and video [7]—such as textbooks, slide decks, and lecture videos. Effectively organizing such multimodal disciplinary knowledge and transforming it into retrievable content to enhance LLM responses remains challenging.

To tackle these issues, we propose Multimodal Disciplinary Knowledge-Augmented Generation (MDKAG), a framework that transcends single-text limitations by constructing a knowledge-augmented generation model over multimodal data. MDKAG aims to provide precise retrieval and generation support for intelligent tutoring systems and to promote a shift from the fragmented integration of educational resources to intelligent empowerment.

The main contributions of this work are threefold:

1.: We present a general and updatable multimodal disciplinary knowledge-graph construction pipeline. Through multimodal data linkage preprocessing, LLM-driven bottom-up ontology design, and named-entity recognition, the framework automatically extracts concepts and builds the graph;
2.: We design a relevance-first retrieval strategy that leverages graph topology and graph-search capabilities to accurately recall educational resources relevant to a query, thereby improving retrieval efficiency and quality;
3.: We develop an answer-verification mechanism that combines semantic-overlap and entity-coverage checks to ensure the accuracy and reliability of generated answers, and we introduce a graph-update module for dynamic optimization and the expansion of the KG.

2. Related Work

Retrieval-Augmented Generation (RAG) and Knowledge-Augmented Generation (KAG) enhance large language models (LLMs) by injecting external knowledge. RAG/KAG systems are usually divided into two stages:

The Data Processing and Indexing Stage: This stage processes input data sources, such as documents or knowledge bases, to enhance their representativeness. The process begins by segmenting documents into manageable chunks:

$D \to {c_{1}, c_{2}, \dots, c_{n}} .$

(1)

Each chunk is then encoded using specialized encoding functions to obtain vector embeddings:

${\vec{d}}_{i} = encoder (c_{i}) .$

(2)

Finally, the data is structured into easily retrievable formats, such as knowledge graphs, to facilitate efficient access during retrieval.
Query Processing and Generation Stage: When a user submits a query q, the system first encodes it into a suitable vector representation:

$\vec{v} = encoder (q) .$

(3)

The system then retrieves the top-k most relevant data points or knowledge components based on similarity metrics:

${d_{1}, d_{2}, \dots, d_{k}} = retrieve (\vec{v}, Database) .$

(4)

Finally, these retrieved elements are integrated with the original query to facilitate informed response generation from the language model.

2.1. Evolution of Graph-Based RAG Techniques

In recent years, RAG has advanced from a single retrieval mode based on textual chunks to graph-structured associative retrieval. GraphRAG, proposed by Darren Edge et al. [8], was the first method to improve RAG performance via graph-database technology; by means of hierarchical clustering and a structured-summary generation mechanism, it markedly enhanced the interpretability and completeness of answers. LightRAG, introduced by Zirui Guo et al. [9], offers a lightweight design for graph-structured RAG: it constructs a graph by extracting RDF quintuplets (subject, predicate, object, time, and location) and, during interactive QA, efficiently recalls relevant passages by analyzing the entities and ontology types contained in the user query. HippoRAG, proposed by Bernal Jiménez Gutiérrez et al. [10], extracts entities with conventional RDF triples (subject, predicate, and object) and builds a knowledge graph based on semantic similarity between entities; at the QA stage, it combines Dense Passage Retrieval (DPR) with Personalized PageRank (PPR) to retrieve supporting passages, effectively blending semantic similarity with graph features and substantially boosting answer accuracy. Focusing on information loss in large-scale long documents, Choubey P.K. et al. [11] first proposed SynthKG, a multi-step, document-level, ontology-free knowledge-graph synthesis workflow driven by LLMs, and then introduced Distill-SynthKG, which fine-tunes a smaller LLM on the synthesized graph to convert documents into a knowledge graph better suited to RAG tasks in a single step. Most recently, Luo H. et al. [12] presented HyperGraphRAG—the first RAG approach to represent knowledge with hypergraphs—which overcomes the limitation of traditional binary-relation graphs in capturing complex n-ary relations. The method entails n-ary-relation extraction via LLMs, hypergraph construction, and a vector-similarity-based hypergraph retrieval strategy.

Entity recognition is a critical step in building these knowledge graphs [13]. Techniques have evolved from rule- and dictionary-based methods [14] to word-embedding models [15] and pretrained models [16], culminating in LLM-based approaches [17]. However, the first three rely heavily on manual annotation and curation, whereas the last still has room for improvement in terms of accuracy.

2.2. RAG Applications in Education

In educational settings, knowledge is usually confined to authoritative textbooks, so ensuring consistency between answers and textbook content is essential. Iddo Drori et al. [18] were among the first to construct knowledge bases from worked examples and course notes, directly generate exams using chain-of-thought prompting, and demonstrate that few-shot learning yields the best performance. Tianshi Zhen et al. [19] noted that in K-12 education, traditional keyword-based retrieval methods retain unique advantages in specificity and high precision; they proposed a hybrid lexical-plus-vector retrieval strategy to combine the strengths of different methods and improve overall system performance. Marc Alier et al. [20] suggested that RAG systems could integrate item banks and knowledge graphs to automatically grade student assignments and generate detailed feedback on correctness, completeness, and logical soundness, thereby helping students monitor and improve their learning. Chenxi Dong et al. [21] showed that incorporating knowledge graphs into RAG markedly increases answer accuracy and logical coherence, achieving up to a 35% improvement in learning-outcome assessments. Masoomali Fatehkia et al. [22] proposed Tree-RAG, which organizes disciplinary knowledge in a tree structure.

In summary, RAG technologies for LLMs are still exploratory in the educational domain. At the data level, most studies center on course notes, item banks, and textbook text. At the retrieval-strategy level, the rigor of academic disciplines calls for combining vector retrieval with high-precision lexical retrieval.

2.3. Limitations of Existing Work

Current multimodal, discipline-specific retrieval-augmented systems are still in their infancy within education, and several issues remain.

1.: Graph construction: Current RAG frameworks build knowledge bases or graphs using a “human-defined schema plus LLM-assisted filling” paradigm; priorities differ across subjects, and reliance on LLM filling incurs hallucination risks;
2.: Retrieval reasoning: By uniformly converting queries into vectors for indexing, the explicit semantics of the graph are blurred in the implicit vector space. Results favor similarity over relevance, failing to exploit the symbolic-reasoning advantages of graphs;
3.: Absence of answer-verification mechanisms: Most work focuses on locating supporting passages before generation but seldom checks whether the answers maintain strong alignment with supporting materials after generation.

3. Methods

This section presents the detailed construction workflow of the proposed MDKAG framework, as illustrated in Figure 1. Following the systematic pipeline from data collection to answer verification, MDKAG addresses the limitations of existing RAG/KAG systems through three core strategies: multimodal knowledge fusion and graph construction (Strategy I), relevance-prioritized retrieval (Strategy II), and answer verification and update strategy (Strategy III).

3.1. Strategy I: Multimodal Knowledge Fusion and Graph Construction

Strategy I encompasses the first four stages of the MDKAG pipeline: data collection, preprocessing, ontology design, and knowledge graph construction.

3.1.1. Data Collection

Educational resources encompass structured data from institutional databases; semi-structured web-based content from learning platforms; and unstructured multimedia materials including textbooks, presentations, and video content. MDKAG focuses on textbooks, courseware presentations, and classroom recordings as representative multimodal educational data sources.

3.1.2. Data Preprocessing

Different data types undergo specialized preprocessing to create unified multimodal intermediate files:

Document Processing: PDF and DOC formats undergo OCR conversion with structural reorganization into Markdown format. Mathematical formulas and tables are preserved as scripts, while images are stored locally with reference links:

Document \to Text + {! [image_name] (storage_path)} .

(5)

Presentation Processing: PPT slides are processed page-wise, extracting textual content and creating descriptions for mixed text-image content to form multimodal intermediate files.

Audio-Visual Processing: Video content utilizes ASR technology for timestamp-aligned transcription, with LLM-based processing to remove colloquialisms and create formal text corpus with source tracking.

3.1.3. Knowledge Graph Ontology Design

Unlike conventional approaches using predefined schemas, MDKAG employs instruction-driven ontology design. Discipline-specific prompts guide LLMs to identify relevant entity categories and generate pre-annotated content:

\begin{matrix} {Schema}_{D} = & LLM ({instruction}_{D}, T_{sample}), \end{matrix}

(6)

\begin{matrix} {Annotations}_{T} = & LLM ({Schema}_{D}, T_{full}) . \end{matrix}

(7)

The system performs hierarchical clustering to reduce computational complexity when entity types exceed 50, creating upper-level ontology categories (e.g., merging “time, period, era” into the “temporal” category).

3.1.4. Knowledge Graph Construction

The Transformer-based uie-m-large model performs entity extraction, generating structured outputs with entity types, positions, and confidence scores. Extracted knowledge is stored in graph databases (Neo4j) or relational databases (MySQL), with multimedia attachments managed through MinIO storage systems.

3.2. Strategy II: Relevance-Prioritized Retrieval Strategy

Strategy II corresponds to the retrieval strategy stage, introducing entity-centric retrieval that leverages knowledge graph topology rather than relying solely on vector similarity.

3.2.1. Query Entity Extraction and Classification

The system first extracts entities from user query q using the same entity extraction model, with entities categorized by semantic types:

E_{q} = {e_{1}, e_{2}, \dots, e_{n}} = extract (q) .

(8)

Each entity is classified into semantic types (e.g., temporal, geographical, and conceptual), enabling systematic entity type analysis for subsequent retrieval path construction.

3.2.2. Graph-Based Retrieval with Divide-and-Conquer Strategy

The retrieval process implements a multi-tier approach based on query complexity:

Single Entity Queries: Retrieve all adjacent nodes containing the entity through direct graph traversal:

R_{single} = neighbors (e_{i}, G) .

(9)

Multiple Entity Queries: Implement a divide-and-conquer approach that constructs precise retrieval combinations based on entity type relationships. For queries with entities of different types, the system constructs targeted retrieval paths:

Paths = {combine (e_{i}, e_{j}) ∣ type (e_{i}) \neq type (e_{j})} .

(10)

Entity Type Combination Retrieval: Given the query “What are the similarities and differences between embroidery in China’s Song and Ming dynasties”, the system performs:

1.

Entity Type Analysis: Geographic (China) + Temporal (Song, Ming) + Artistic Practice (embroidery)

2.

Combination Path Construction: Generate two precise retrieval paths:

Path 1: China + Song Dynasty + embroidery;
Path 2: China + Ming Dynasty + embroidery.

3.

Interference Filtering: Avoid retrieving semantically similar but imprecise content.

3.2.3. Hybrid Ranking Mechanism

Content ranking employs a weighted scoring system that prioritizes entity type combination matching:

Score = α \cdot TCM + β \cdot EC + (1 - α - β) \cdot VS,

(11)

where TCM represents Type Combination Match, EC denotes Entity Coverage, and VS refers to Vector Similarity computed through cosine similarity.

We set

α

= 0.6 to prioritize type-consistent matching with moderate cross-type flexibility, and

β

= 0.2 to enable partial entity matching while emphasizing semantic similarity for contextually relevant retrieval.

3.3. Strategy III: Answer Verification and Update Strategy

Strategy III implements the answer verification mechanism stage, addressing the lack of output validation in existing RAG/KAG systems through dual verification criteria and dynamic updating capabilities.

3.3.1. Semantic-Entity Coverage Verification

Generated responses are segmented into semantic units based on Markdown structure. For each semantic unit, the system computes verification metrics:

\begin{matrix} Sim & = \frac{\vec{a n s w e r} \cdot \vec{r e f e r e n c e}}{| | \vec{a n s w e r} | | \cdot | | \vec{r e f e r e n c e} | |}, \end{matrix}

(12)

\begin{matrix} Coverage & = \frac{| {Entities}_{answer} \cap {Entities}_{reference} |}{| {Entities}_{reference} |}, \end{matrix}

(13)

Validation employs dual thresholds:

Sim > 0.85

and

Coverage > 0.80

. Responses failing these criteria trigger regeneration with expanded materials.

3.3.2. Dynamic Graph Update Mechanism

Unlike conventional RAG systems that cannot perceive whether the knowledge base adequately supports question answering, MDKAG implements a systematic insufficiency detection mechanism. When the entity coverage falls below the threshold (

Coverage < 0.80

), the system recognizes that retrieval results are insufficient to support reliable answer generation. After multiple attempts to increase the number of reference materials fail to improve coverage scores, MDKAG determines that the knowledge base lacks sufficient content to ensure response validity. At this point, the system initiates updates to the Multimodal Disciplinary Knowledge Graph (MDKG) through three strategic approaches:

Incremental Knowledge Extraction: Utilizes existing pre-trained models for targeted knowledge expansion when new literature becomes available, maintaining consistency with existing data while systematically addressing identified knowledge gaps through focused extraction processes.

User Feedback Integration: Incorporates expert reviews and user corrections through collaborative mechanisms, ensuring content accuracy and professional relevance while adapting to evolving domain knowledge based on real-world usage patterns and identified deficiencies.

Specialized Vocabulary Enhancement: Constructs dynamic terminology lists based on extraction results and expert guidance, improving domain-specific entity recognition and guiding LLM extraction processes to address vocabulary gaps that contribute to insufficient coverage scores.

This systematic approach ensures that knowledge graph updates are triggered by objective insufficiency indicators rather than arbitrary schedules, maintaining both knowledge currency and response reliability in educational question-answering systems.

4. Experimental Setup

4.1. Datasets

We evaluate MDKAG’s capabilities on three educational domains with diverse multimodal content characteristics. The data scale involved in this study is shown in Table 1. The experimental data consists of three courses: “Principles of Digital Television” (textbook + syllabus), “Costume History” (textbook + PPT), and “Calligraphy Appreciating” (classroom recordings).

4.2. Baselines

We compare MDKAG against several established retrieval-augmented generation approaches: traditional RAG using the open-source Langchain-Chatchat project (https://github.com/chatchat-space/Langchain-Chatchat, accessed on 14 August 2025) based on the Langchain framework (https://github.com/langchain-ai/langchain, accessed on 14 August 2025), and KAG (https://github.com/OpenSPG/KAG, accessed on 14 August 2025) with OpenSPG (https://github.com/OpenSPG/KAG, accessed on 14 August 2025) as the preprocessing component. These baselines represent the current state-of-the-art in educational domain question answering systems.

4.3. Metrics

We evaluate system performance using standard information retrieval and QA metrics for comprehensive assessment across different components of the MDKAG framework.

4.3.1. Entity Extraction Evaluation

For entity extraction tasks in Strategy I, we employ standard classification metrics:

\begin{matrix} Accuracy (ACC) = & \frac{T P + T N}{T P + T N + F P + F N}, \end{matrix}

(14)

\begin{matrix} Recall (R) = & \frac{T P}{T P + F N}, \end{matrix}

(15)

\begin{matrix} F1-score & = \frac{2 \times A C C \times Recall}{A C C + Recall} . \end{matrix}

(16)

where

T P

,

T N

,

F P

, and

F N

denote true positives, true negatives, false positives, and false negatives, respectively.

4.3.2. Semantic Similarity and Entity Coverage Assessment

For retrieval effectiveness evaluation and answer verification, we employ the semantic similarity and entity coverage metrics (Equations (12) and (13)). These metrics assess the quality of generated answers and retrieved content by measuring both semantic coherence and factual completeness.

4.3.3. End-to-End QA Evaluation

For comprehensive system evaluation, we employ a 5-point scoring system to assess overall answer quality:

Score Rate = \frac{\sum_{i = 1}^{N} S_{i}}{N \times 5} .

(17)

where N is the total number of questions and

S_{i} \in {1, 2, 3, 4, 5}

represents the score for the i-th answer. For answer quality assessment, we employ both domain experts and automated evaluation using QwQ-32B, which demonstrated the best performance in Experiment I and is considered the model with the best semantic understanding capabilities in this paper, making it reliable for objective evaluation alongside domain expert assessments.

4.3.4. Answer Validation Criteria

The answer validation mechanism employs dual verification criteria with semantic similarity threshold

θ_{sim} = 0.85

and entity coverage threshold

θ_{cov} = 0.80

:

\begin{matrix} {Valid}_{sem} & = \{\begin{matrix} 1, & if Sim (s_{i}, r e f) \geq θ_{sim} \\ 0, & otherwise, \end{matrix}, \end{matrix}

(18)

\begin{matrix} {Valid}_{ent} & = \{\begin{matrix} 1, & if Coverage \geq θ_{cov} \\ 0, & otherwise \end{matrix}, \end{matrix}

(19)

\begin{matrix} {Answer}_{valid} & = {Valid}_{sem} \land {Valid}_{ent} . \end{matrix}

(20)

4.4. Implementation Details

The model training environment utilized the Ubuntu operating system with Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz and GeForce RTX 2080 Ti Rev. A GPU. Large language models were accessed through Coze platform API or corresponding official APIs. The Transformer-based uie-m-large model [23] was used for entity extraction and knowledge graph construction.

5. Results

5.1. Strategy I: Multimodal Knowledge Fusion and Graph Construction

To evaluate the effectiveness of Strategy I in multimodal knowledge fusion and graph construction, we conducted a comprehensive entity extraction evaluation using manually annotated validation sets. For each educational domain, 100 data samples without image links were randomly selected and manually annotated to establish ground truth. The evaluation employed instruction-driven prompts for entity extraction, with evaluation criteria based on successful entity identification rather than strict type consistency, ensuring fair comparison across different approaches.

The comparative analysis, presented in Table 2, demonstrates the performance of Strategy I against six state-of-the-art large language models across three educational domains. Our approach utilizes GPT-3.5 as the base model enhanced with Strategy I (denoted as “GPT-3.5 + Strategy I”) to demonstrate the framework’s capability to improve weaker baseline models.

As shown in Table 2, MDKAG has significantly enhanced entity extraction performance across all three educational domains. The method has improved the relatively weak knowledge point extraction capability of GPT-3.5 to a level comparable to that of QwQ-32B, demonstrating its potential to strengthen knowledge graph construction.

5.2. Strategy II: Relevance-Prioritized Retrieval Strategy

To validate the effectiveness of this approach, we employed the 3 × 100 data samples from Section 5.1 and generated a comprehensive question bank using the prompt “Design a question-and-answer pair within 20 characters for this content.” We conducted semantic similarity comparisons among RAG, KAG, and MDKAG results against the top-3 retrieved text blocks using the bge-large-zh-v1.5 model. The similarity scores ranged from 87–95% with minimal inter-method differences (<0.08), indicating comparable semantic alignment across all approaches.

For more comprehensive evaluation, we employed entity coverage rate as a key metric—calculating the proportion of entities extracted from generated results relative to reference texts. The comparative analysis revealed consistent performance levels (RAG: 0.624, KAG: 0.702, and MDKAG: 0.712), as illustrated in Figure 2. Notably, MDKAG demonstrated superior stability with significantly higher minimum coverage rates, indicating its distinct advantage in comprehensive knowledge point retrieval and highlighting the effectiveness of the relevance-prioritized retrieval strategy.

5.3. Strategy III: Answer Verification and Update Strategy

Using Strategy II results as initial answers, we extracted entities and reference materials, then analyzed generated responses by semantic segments based on Markdown structure. The system employs dual verification criteria: semantic similarity threshold (85%) and entity coverage threshold (80%). When thresholds are not met, the system automatically adds more reference content (incrementally adding 2 sources in this study) and regenerates answers.

As shown in Table 3, Strategy III consistently enhances performance across all methods, with MDKAG achieving the highest final performance score of 0.751. The verification mechanism shows domain-specific effectiveness: technical subjects like “Principles of Digital Television” benefit significantly from iterative refinement, while content-rich domains like ”Costume History“ show smaller but consistent improvements. This validates the adaptive nature of the verification strategy in addressing knowledge insufficiency through systematic content expansion.

In summary, Figure 3 showcases MDKAG in action. The screenshot presents the user question, the system’s answer grounded in the multimodal disciplinary knowledge graph, and the supporting multimodal evidence—demonstrating accurate end-to-end QA under Strategies I–III and comprehensive coverage of relevant materials.

6. Discussion

The MDKAG framework proposed in this study has achieved significant effectiveness in knowledge-augmented generation within the educational domain, and these findings warrant in-depth exploration within a broader academic context.

6.1. Theoretical Breakthrough and Practical Value of Multimodal Knowledge Integration

MDKAG successfully realized the integration of multimodal educational resources ranging from textbooks and courseware to video recordings, achieving an average F1 score of 0.8481, representing a 22.37% improvement over the original performance of GPT-3.5. This breakthrough resonates with the emphasis by Edge et al. on the importance of graph structures for knowledge organization in GraphRAG, while further addressing the technical challenges of multimodal data fusion in educational scenarios. Cross-disciplinary experimental validation confirmed the framework’s adaptability, though performance variations across different domains suggest the need for further research into domain-specific knowledge representation methods. Although entity extraction remains dependent on LLM performance and computational complexity increases with knowledge scale, the instruction-driven entity-extraction methoddemonstrates superior flexibility compared to traditional predefined patterns.

6.2. Educational Significance of Retrieval Strategy Innovation and Answer Verification Mechanisms

The advantage of the relevance-priority retrieval strategy in entity coverage (0.7124 vs. RAG’s 0.6239) confirms the superiority of graph-structured retrieval over traditional vector retrieval, consistent with findings by Gutiérrez et al. in HippoRAG. The dual-mechanism verification framework (85% semantic similarity and 80% entity coverage thresholds) improved MDKAG’s average score from 0.6933 to 0.7511, holding significant educational implications: in knowledge-intensive educational applications, answer accuracy and consistency directly impact learning effectiveness. This verification framework not only reduces hallucination phenomena but also establishes a sustainable knowledge update mechanism, which is crucial for the timeliness and accuracy of educational content.

6.3. Generalization Across Disciplines and Languages

MDKAG was designed from the outset for broad generality and low adaptation cost. First, we have illustrated entity-extraction cases across multiple disciplines—including economics, politics, philosophy, history, mathematics, physics, and computer science. Coupled with a general-purpose LLM, this yields reasonable performance even when in-domain exemplars are limited; where coverage gaps arise, a light human-in-the-loop pass (case curation and prompt refinement) can quickly improve results without changing the pipeline. Second, we compress and harmonize entity-type inventories via synonym/near-synonym consolidation and upper-level ontology abstraction. This reduces indexing and update complexity and supports cross-disciplinary synthesis: onboarding a new subject (e.g., medicine or law) amounts to concatenating its type list with the existing inventory, de-duplicating, and mapping to upper-level categories; Strategies I–III then proceed unchanged. The resulting unified type set facilitates the linking and reuse of overlapping knowledge points across disciplines.

For cross-lingual deployment, the extraction, retrieval, and verification procedures remain intact in principle. At larger scales, the two-tier query strategy (entity-adjacency expansion followed by vector refinement) preserves near-linear retrieval behavior as content grows, while maintaining the accuracy and traceability observed in our experiments.

6.4. Future Research Directions and Technological Development Prospects

Based on the findings of this study, future research should focus on:

Personalized learning path generation, leveraging knowledge graph structural information to customize adaptive education for learners;
Cross-lingual knowledge graph construction to promote international educational resource sharing;
Real-time knowledge update mechanisms, developing efficient incremental learning algorithms;
Cognitive load optimization, integrating educational psychology theories to optimize knowledge;
Large-scale deployment studies, exploring performance optimization in practical applications.

While MDKAG provides a new technical pathway for the educational technology field, achieving true digital transformation in education requires continued in-depth research across multiple aspects, including technical optimization, educational theory integration, and practical application.

7. Conclusions

The effective organization and application of educational resources represents a significant technical challenge in education. MDKAG constructs multimodal domain-specific knowledge graphs from textbooks, instructional videos, and presentations, thereby enhancing the accuracy of large language models in educational question-answering tasks. The main contributions are threefold: First, this paper introduces domain-specific ontology design and multimodal knowledge graph construction within the KAG framework. Second, an entity-driven retrieval strategy is proposed that leverages query entity extraction to locate relevant domain knowledge. Third, the framework presents an entity coverage-based answer verification mechanism with dynamic update capabilities, supporting fine-grained, coarse-grained, and complete reconstruction approaches.

Author Contributions

Conceptualization, X.Z.; methodology, X.Z., G.W. and Y.L.; software, X.Z.; validation, X.Z. and Y.L.; formal analysis, X.Z.; investigation, X.Z. and Y.L.; resources, X.Z.; data curation, X.Z.; writing—original draft preparation, X.Z.; writing—review and editing, X.Z. and G.W.; visualization, X.Z.; supervision, G.W.; project administration, G.W.; and funding acquisition, G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Education and Scientific Research Project of Shanghai China grant number C2023100. The APC was funded by Education and Scientific Research Project of Shanghai China (C2023100).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, J.; Leng, F.; Wu, W.; Bao, Y. Construction Method of Textbook Knowledge Graph Based on Multimodal and Knowledge Distillation. J. Front. Comput. Sci. Technol. 2024, 18, 2901–2911. [Google Scholar]
Liu, Z.; Wang, P.; Song, X.; Zhang, X.; Jiang, B. Survey on Hallucinations in Large Language Models. J. Softw. 2025, 36, 1152–1185. [Google Scholar] [CrossRef]
He, J.; Shen, Y.; Xie, R. Recognition and Optimization of Hallucination Phenomena in Large Language Models. J. Comput. Appl. 2025, 45, 709–714. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 9459–9474. [Google Scholar]
Yijie, C. Research on Hallucination in Controlled Text Generation. Master’s Thesis, University of Electronic Science and Technology of China, Chengdu, China, 2024. [Google Scholar]
Liang, L.; Bo, Z.; Gui, Z.; Zhu, Z.; Zhong, L.; Zhao, P.; Sun, M.; Zhang, Z.; Zhou, J.; Chen, W.; et al. Kag: Boosting llms in professional domains via knowledge augmented generation. In Proceedings of the Companion Proceedings of the ACM on Web Conference, Sydney, Australia, 28 April–2 May 2025; pp. 334–343. [Google Scholar]
Qika, L.; Lingling, Z.; Jun, L.; Tianzhe, Z. Question-aware Graph Convolutional Network for Educational Knowledge Base Question Answering. J. Front. Comput. Sci. Technol. 2021, 15, 1880–1887. [Google Scholar]
Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Metropolitansky, D.; Ness, R.O.; Larson, J. From local to global: A graph rag approach to query-focused summarization. arXiv 2024, arXiv:2404.16130. [Google Scholar] [CrossRef]
Guo, Z.; Xia, L.; Yu, Y.; Ao, T.; Huang, C. LightRAG: Simple and Fast Retrieval-Augmented Generation. arXiv 2024, arXiv:2410.05779. [Google Scholar]
Gutiérrez, B.J.; Shu, Y.; Gu, Y.; Yasunaga, M.; Su, Y. Hipporag: Neurobiologically inspired long-term memory for large language models. In Proceedings of the The Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Choubey, P.K.; Su, X.; Luo, M.; Peng, X.; Xiong, C.; Le, T.; Rosenman, S.; Lal, V.; Mui, P.; Ho, R.; et al. Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency. arXiv 2024, arXiv:2410.16597. [Google Scholar]
Luo, H.; Chen, G.; Zheng, Y.; Wu, X.; Guo, Y.; Lin, Q.; Feng, Y.; Kuang, Z.; Song, M.; Zhu, Y.; et al. HyperGraphRAG: Retrieval-Augmented Generation with Hypergraph-Structured Knowledge Representation. arXiv 2025, arXiv:2503.21322. [Google Scholar]
Li, Y.; Liang, Y.; Yang, R.; Qiu, J.; Zhang, C.; Zhang, X. CourseKG: An educational knowledge graph based on course information for precision teaching. Appl. Sci. 2024, 14, 2710. [Google Scholar] [CrossRef]
Yadav, V.; Bethard, S. A survey on recent advances in named entity recognition from deep learning models. arXiv 2019, arXiv:1910.11470. [Google Scholar] [CrossRef]
Jiarui, L.; Huayu, L.; Yang, Y. Construction of Discipline Knowledge Graph for Multi-Source Heterogeneous Data Sources. Comput. Syst. Appl. 2021, 30, 59–67. [Google Scholar] [CrossRef]
Melnyk, I.; Dognin, P.; Das, P. Knowledge graph generation from text. arXiv 2022, arXiv:2211.10511. [Google Scholar] [CrossRef]
Lairgi, Y.; Moncla, L.; Cazabet, R.; Benabdeslem, K.; Cléau, P. itext2kg: Incremental knowledge graphs construction using large language models. In Proceedings of the International Conference on Web Information Systems Engineering, Doha, Qatar, 2–5 December 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 214–229. [Google Scholar]
Drori, I.; Zhang, S.J.; Shuttleworth, R.; Zhang, S.; Tyser, K.; Chin, Z.; Lantigua, P.; Surbehera, S.; Hunter, G.; Austin, D.; et al. From human days to machine seconds: Automatically answering and generating machine learning final exams. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 3947–3955. [Google Scholar]
Zheng, T.; Li, W.; Bai, J.; Wang, W.; Song, Y. Assessing the Robustness of Retrieval-Augmented Generation Systems in K-12 Educational Question Answering with Knowledge Discrepancies. arXiv 2024, arXiv:2412.08985. [Google Scholar] [CrossRef]
Alier, M.; García-Peñalvo, F.; Camba, J.D. Generative artificial intelligence in education: From deceptive to disruptive. Int. J. Interact. Multimed. Artif. Intell. 2024, 8, 5–14. [Google Scholar] [CrossRef]
Dong, C.; Yuan, Y.; Chen, K.; Cheng, S.; Wen, C. How to Build an Adaptive AI Tutor for Any Course Using Knowledge Graph-Enhanced Retrieval-Augmented Generation (KG-RAG). In Proceedings of the 2025 14th International Conference on Educational and Information Technology (ICEIT), Guangzhou, China, 14–16 March 2025; pp. 152–157. [Google Scholar]
Fatehkia, M.; Lucas, J.K.; Chawla, S. T-RAG: Lessons from the LLM trenches. arXiv 2024, arXiv:2402.07483. [Google Scholar] [CrossRef]
Lu, Y.; Liu, Q.; Dai, D.; Xiao, X.; Lin, H.; Han, X.; Sun, L.; Wu, H. Unified structure generation for universal information extraction. arXiv 2022, arXiv:2203.12277. [Google Scholar] [CrossRef]

Figure 1. MDKAG framework overview.

Figure 2. Entity coverage rate comparison across retrieval strategies: RAG’s Recall Strategy, KAG’s Recall Strategy, and Strategy II (MDKAG).

Figure 3. Overall MDKAG QA workflow and output (screenshot).

Table 1. Statistics of multimodal data.

Subject	Text Segments	Images	Other Modalities
Principles of Digital Television	1269	279	0
Costume History	900	564	0
Calligraphy Appreciating	1034	0	12

Table 2. Entity-extraction comparison between Strategy I and direct LLM outputs.

Subject	Model	ACC	Recall	F1
Principles of Digital Television	DeepSeek-R1	0.805	0.920	0.859
	Kimi	0.756	0.885	0.815
	Doubao	0.641	0.846	0.729
	GPT-4o	0.749	0.822	0.784
	QwQ-32B	0.781	0.911	0.840
	GPT-3.5	0.671	0.791	0.726
	GPT-3.5 + Strategy I	0.880	0.896	0.888
Costume History	DeepSeek-R1	0.867	0.730	0.793
	Kimi	0.781	0.766	0.773
	Doubao	0.640	0.820	0.719
	GPT-4o	0.751	0.791	0.770
	QwQ-32B	0.800	0.860	0.829
	GPT-3.5	0.557	0.783	0.651
	GPT-3.5 + Strategy I	0.749	0.843	0.794
Calligraphy Appreciating	DeepSeek-R1	0.842	0.910	0.875
	Kimi	0.771	0.860	0.813
	Doubao	0.610	0.831	0.704
	GPT-4o	0.721	0.843	0.777
	QwQ-32B	0.843	0.913	0.877
	GPT-3.5	0.714	0.785	0.748
	GPT-3.5 + Strategy I	0.857	0.869	0.863
Average	DeepSeek-R1	0.838	0.854	0.842
	Kimi	0.769	0.837	0.800
	Doubao	0.630	0.832	0.717
	GPT-4o	0.740	0.819	0.777
	QwQ-32B	0.808	0.895	0.849
	GPT-3.5	0.648	0.787	0.708
	GPT-3.5 + Strategy I	0.829	0.869	0.848

Table 3. Answer quality comparison before and after Strategy III implementation (5-point scale).

Subject	Method	Before Strategy III	After Strategy III
Principles of Digital Television	RAG	0.627	0.733 (+16.9%)
	KAG	0.653	0.760 (+16.4%)
	MDKAG	0.667	0.787 (+18.0%)
Costume History	RAG	0.680	0.693 (+1.9%)
	KAG	0.720	0.720 (+0.0%)
	MDKAG	0.746	0.760 (+1.9%)
Calligraphy Appreciating	RAG	0.653	0.653 (+0.0%)
	KAG	0.640	0.706 (+10.3%)
	MDKAG	0.667	0.706 (+5.8%)
Average	RAG	0.653	0.693 (+6.1%)
	KAG	0.671	0.729 (+8.6%)
	MDKAG	0.693	0.751 (+8.4%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, X.; Wang, G.; Lu, Y. MDKAG: Retrieval-Augmented Educational QA Powered by a Multimodal Disciplinary Knowledge Graph. Appl. Sci. 2025, 15, 9095. https://doi.org/10.3390/app15169095

AMA Style

Zhao X, Wang G, Lu Y. MDKAG: Retrieval-Augmented Educational QA Powered by a Multimodal Disciplinary Knowledge Graph. Applied Sciences. 2025; 15(16):9095. https://doi.org/10.3390/app15169095

Chicago/Turabian Style

Zhao, Xu, Guozhong Wang, and Yufei Lu. 2025. "MDKAG: Retrieval-Augmented Educational QA Powered by a Multimodal Disciplinary Knowledge Graph" Applied Sciences 15, no. 16: 9095. https://doi.org/10.3390/app15169095

APA Style

Zhao, X., Wang, G., & Lu, Y. (2025). MDKAG: Retrieval-Augmented Educational QA Powered by a Multimodal Disciplinary Knowledge Graph. Applied Sciences, 15(16), 9095. https://doi.org/10.3390/app15169095

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MDKAG: Retrieval-Augmented Educational QA Powered by a Multimodal Disciplinary Knowledge Graph

Abstract

1. Introduction

2. Related Work

2.1. Evolution of Graph-Based RAG Techniques

2.2. RAG Applications in Education

2.3. Limitations of Existing Work

3. Methods

3.1. Strategy I: Multimodal Knowledge Fusion and Graph Construction

3.1.1. Data Collection

3.1.2. Data Preprocessing

3.1.3. Knowledge Graph Ontology Design

3.1.4. Knowledge Graph Construction

3.2. Strategy II: Relevance-Prioritized Retrieval Strategy

3.2.1. Query Entity Extraction and Classification

3.2.2. Graph-Based Retrieval with Divide-and-Conquer Strategy

3.2.3. Hybrid Ranking Mechanism

3.3. Strategy III: Answer Verification and Update Strategy

3.3.1. Semantic-Entity Coverage Verification

3.3.2. Dynamic Graph Update Mechanism

4. Experimental Setup

4.1. Datasets

4.2. Baselines

4.3. Metrics

4.3.1. Entity Extraction Evaluation

4.3.2. Semantic Similarity and Entity Coverage Assessment

4.3.3. End-to-End QA Evaluation

4.3.4. Answer Validation Criteria

4.4. Implementation Details

5. Results

5.1. Strategy I: Multimodal Knowledge Fusion and Graph Construction

5.2. Strategy II: Relevance-Prioritized Retrieval Strategy

5.3. Strategy III: Answer Verification and Update Strategy

6. Discussion

6.1. Theoretical Breakthrough and Practical Value of Multimodal Knowledge Integration

6.2. Educational Significance of Retrieval Strategy Innovation and Answer Verification Mechanisms

6.3. Generalization Across Disciplines and Languages

6.4. Future Research Directions and Technological Development Prospects

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI