CDE: A Concept-Driven Joint Extraction Method for Computer Science Textbooks
Abstract
1. Introduction
- 1.
- A concept-driven knowledge unit generation mechanism is proposed, which extracts core concepts from textbook chapters and constructs concept-enhanced instructions; the model’s sensitivity to domain-specific terminology is heightened via attention re-weighting.
- 2.
- A schema-constrained structured generation mechanism is devised, wherein entity- and relation-type schemas are injected into the instruction templates; this, combined with decoding constraint strategies, ensures that the output triple sequences strictly adhere to JSON Schema specifications.
- 3.
- An efficient parameter adaptation strategy leveraging LoRA is adopted, which performs domain-specific fine-tuning while freezing the majority of the base model parameters; this approach significantly reduces training overhead and mitigates model hallucination.
2. Methods
2.1. Overall Framework of the Proposed Method
2.2. Concept-Driven Knowledge Unit Generation Mechanism
2.2.1. Construction of Conceptual Priors
2.2.2. Construction of Concept-Enhanced Instructions
2.2.3. Concept-Driven Attention Re-Weighting
2.3. Schema-Constrained Structured Generation Mechanism
2.3.1. Schema Design
2.3.2. Schema Injection and Structured Instruction Template
2.3.3. Schema-Constrained Decoding Strategy
2.4. Efficient Parameter Adaptation via LoRA
3. Experimental Design and Result Analysis
3.1. Dataset Construction
3.2. Experimental Setup
3.3. Analysis of Training Process
3.4. Comparative Experiments
- BERT-BiLSTM-CRF [9] integrates the BERT pre-trained language model with a BiLSTM-CRF architecture, performing named entity recognition first followed by relation extraction, thereby constituting a conventional pipeline-based approach for information extraction.
- CasRel [11] reformulates relation extraction as a relation-specific cascaded binary tagging process for subjects and objects, enabling joint extraction of entities and relations within a unified framework.
- TPLinker [12] employs a handshaking tagging strategy to transform joint extraction into a token-pair link prediction task, thereby facilitating the joint extraction of overlapping relations and complex semantic structures.
- Qwen3-4B [25] is an open-source language model from the Qwen3 series with 4B parameters, possessing robust capabilities in general text understanding and generation and supporting relatively long contextual inputs.
- UIE [17] introduces structured prompts as unified task instructions and incorporates schema-based contrastive learning to achieve unified modeling across diverse information extraction tasks.
- EDC [19] designs a three-stage framework termed “Extract-Define-Canonicalize,” which addresses the challenge of large-scale schemas exceeding the context window of LLMs through open information extraction, schema definition, and post hoc canonicalization.
- LKD-KGC [20] autonomously analyzes document corpora, infers knowledge dependencies, and autoregressively generates entity schemas through LLM-driven knowledge dependency parsing. Furthermore, it incorporates entity-linking information from external knowledge bases during the relation classification stage to enhance the model’s discriminative capacity for domain-specific relations.
3.5. Ablation Study
3.6. Case Study
3.7. Knowledge Graph Construction Instance
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Liu, J.; Leng, F.; Wu, W.; Bao, Y. A method for constructing textbook knowledge graphs based on multimodality and knowledge distillation. J. Front. Comput. Sci. Technol. 2024, 18, 2901–2911. [Google Scholar]
- Liu, Q.; Li, Y.; Duan, H.; Liu, Y.; Qin, Z. A Survey of Knowledge Graph Construction Techniques. J. Comput. Res. Dev. 2016, 53, 582–600. [Google Scholar]
- Li, Z.; Zhou, D. Research on Conceptual Model and Construction Methods of Educational Knowledge Graph. e-Educ. Res. 2019, 40, 78–86. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
- Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; Xu, B. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 207–212. [Google Scholar]
- Chan, Y.S.; Roth, D. Exploiting background knowledge for relation extraction. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010); Association for Computational Linguistics: Stroudsburg, PA, USA, 2010; pp. 152–160. [Google Scholar]
- Chen, P.; Lu, Y.; Zheng, V.W.; Chen, X.; Yang, B. Knowedu: A system to construct knowledge graph for education. IEEE Access 2018, 6, 31553–31563. [Google Scholar] [CrossRef]
- Zou, X.; Lin, H.; Wu, J.; Zheng, C.; Guan, Q. Constructing a knowledge graph for the database course group via deep learning. In Proceedings of the 2024 13th International Conference on Educational and Information Technology (ICEIT), Chengdu, China, 22–24 March 2024; pp. 334–339. [Google Scholar]
- Li, N.; Shen, Q.; Song, R.; Chi, Y.; Xu, H. MEduKG: A deep-learning-based approach for multi-modal educational knowledge graph construction. Information 2022, 13, 91. [Google Scholar] [CrossRef]
- Miwa, M.; Bansal, M. End-to-end relation extraction using lstms on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 1105–1116. [Google Scholar]
- Wei, Z.; Su, J.; Wang, Y.; Tian, Y.; Chang, Y. A novel cascade binary tagging framework for relational triple extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1476–1488. [Google Scholar]
- Wang, Y.; Yu, B.; Zhang, Y.; Liu, T.; Zhu, H.; Sun, L. TPLinker: Single-stage joint extraction of entities and relations through token pair linking. In Proceedings of the 28th International Conference on Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1572–1582. [Google Scholar]
- Zheng, S.; Wang, F.; Bao, H.; Hao, Y.; Zhou, P.; Xu, B. Joint extraction of entities and relations based on a novel tagging scheme. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 1227–1236. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Son, J.; Kim, J.; Lim, J.; Lim, H.S. GRASP: Guiding model with RelAtional semantics using prompt for dialogue relation extraction. In Proceedings of the 29th International Conference on Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 412–423. [Google Scholar]
- Chen, X.; Zhang, N.; Xie, X.; Deng, S.; Yao, Y.; Tan, C.; Huang, F.; Si, L.; Chen, H. Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In Proceedings of the ACM Web Conference 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 2778–2788. [Google Scholar]
- Lu, Y.; Liu, Q.; Dai, D.; Xiao, X.; Lin, H.; Han, X.; Sun, L.; Wu, H. Unified structure generation for universal information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 5755–5772. [Google Scholar]
- Chia, Y.K.; Bing, L.; Poria, S.; Si, L. RelationPrompt: Leveraging prompts to generate synthetic data for zero-shot relation triplet extraction. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 45–57. [Google Scholar]
- Zhang, B.; Soh, H. Extract, define, canonicalize: An llm-based framework for knowledge graph construction. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 9820–9836. [Google Scholar]
- Sun, J.; Qian, S.; Han, Z.; Li, W.; Qian, Z.; Yang, D.; Cao, J.; Xue, G. LKD-KGC: Domain-specific KG construction via LLM-driven knowledge dependency parsing. arXiv 2025, arXiv:2505.24163. [Google Scholar]
- Lu, Y.; Wu, W.; Zhao, X.; Peng, R.; Wang, J. Karma: Leveraging multi-agent llms for automated knowledge graph enrichment. arXiv 2025, arXiv:2502.06472. [Google Scholar]
- Kuculo, T.; Abdollahi, S.; Gottschalk, S. Transformer-Based Architectures Versus Large Language Models in Semantic Event Extraction: Evaluating Strengths and Limitations. Semant. Web 2025, 16, 22104968251363759. [Google Scholar] [CrossRef]
- Popovic, N.; Kangen, A.; Schopf, T.; Färber, M. DocIE@ XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations. In Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 298–309. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 25–29 April 2022; pp. 1–3. [Google Scholar]
- Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 technical report. arXiv 2025, arXiv:2505.09388. [Google Scholar] [CrossRef]







| Textbook Category | Entity Type | Relation Type |
|---|---|---|
| Operating Systems | OS Module, Interface, Mechanism, Algorithm, Process, Thread, IPC, and Security. | PartOf, Implements, Supports, DependsOn, UsedFor, Allocates, Protects, Schedules, and DefinesAs |
| Computer Organization and Architecture | Computer System, Bus, Cache, ALU, CPU, Register, Instruction Set, I/O Interface, DMA, and Interrupt | PartOf, IsA, Provides, Implements, DependsOn, Handles, Optimizes, and UsedFor |
| Software Engineering | Process Model, Requirement, Project Artifact, UML Diagram, Design Principle, and Testing Technique | PartOf, IsA, DefinesAs, Produces, Implements, Verifies, Validates, Uses, Measures, Manages, and Mitigates |
| Data Structures | Linear List, Stack, Queue, Tree Structure, Graph Structure, Hash Structure, and Search Algorithm | PartOf, IsA, Implements, Supports, BasedOn, MapsTo, HasProperty, Optimizes, Traverses, Orders, and Solves |
| You are a professional information extraction expert responsible for extracting entities and relations from textbooks. Based on the following example, use the predefined entity types and relation types to extract entities and relations within the “Operating Systems” domain from the provided text. The output must be a JSON list, where each JSON object contains the keys: “entity1”, “entity_type1”, “relation”, “entity2”, “entity_type2”. |
| Entity type set: {entity_schema} |
| Relation type set: {relation_schema} |
| Example: “A time-sharing system allows multiple users to interactively use the computer, and the system employs time-slice rotation to ensure timely response.” |
| Output: {“entity1”: “time-sharing system”, “entity_type1”: “OS Type”, “relation”: “Implements”, “entity2”: “time-slice rotation”, “entity_type2”: “Mechanism”} |
| Text: {text} |
| Model | Acc | Precision | Recall | F1 |
|---|---|---|---|---|
| BERT-BiLSTM-CRF | 0.6812 ± 0.011 | 0.7021 ± 0.015 | 0.6487 ± 0.019 | 0.6743 ± 0.012 |
| CasRel | 0.7583 ± 0.009 | 0.7721 ± 0.010 | 0.7456 ± 0.013 | 0.7586 ± 0.008 |
| TPLinker | 0.7718 ± 0.007 | 0.7846 ± 0.009 | 0.7621 ± 0.011 | 0.7732 ± 0.007 |
| Qwen3-4B | 0.5964 ± 0.021 | 0.6231 ± 0.023 | 0.5714 ± 0.028 | 0.5961 ± 0.019 |
| UIE | 0.7425 ± 0.012 | 0.7553 ± 0.014 | 0.7311 ± 0.016 | 0.7430 ± 0.011 |
| EDC | 0.7632 ± 0.009 | 0.7741 ± 0.009 | 0.7564 ± 0.012 | 0.7651 ± 0.010 |
| LKD-KGC | 0.7931 ± 0.008 | 0.8005 ± 0.011 | 0.7867 ± 0.012 | 0.7936 ± 0.009 |
| CDE | 0.8171 ± 0.006 | 0.8245 ± 0.007 | 0.8122 ± 0.009 | 0.8183 ± 0.006 |
| Model | Precision | Recall | F1 |
|---|---|---|---|
| CDE | 0.8245 | 0.8122 | 0.8183 |
| w/o Concept | 0.7812 | 0.7689 | 0.7750 |
| w/o Schema | 0.7524 | 0.7671 | 0.7597 |
| w/o LoRA | 0.7031 | 0.6415 | 0.6709 |
| Input Text | Baseline Model Output | CDE Model Output |
|---|---|---|
| “A process is the basic unit of resource allocation and scheduling in an operating system.” | {"entity1":"process", "relation":"is", "entity2":"basic unit"} | {"knowledge_units":[ |
| {"entity1":"process", "relation":"DefinesAs", "entity2":"the basic unit."}, | ||
| {"entity1":"Process", "relation":"UsedFor", "entity2":"Resource Allocation"}, | ||
| {"entity1":"Operating System", "relation":"Schedules", "entity2":"Process"}]} |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yusufu, A.; Shen, H.; Zhong, X.; Liu, J.; Ainiwaer, A.; Yusufu, A. CDE: A Concept-Driven Joint Extraction Method for Computer Science Textbooks. Appl. Sci. 2026, 16, 5961. https://doi.org/10.3390/app16125961
Yusufu A, Shen H, Zhong X, Liu J, Ainiwaer A, Yusufu A. CDE: A Concept-Driven Joint Extraction Method for Computer Science Textbooks. Applied Sciences. 2026; 16(12):5961. https://doi.org/10.3390/app16125961
Chicago/Turabian StyleYusufu, Aizierguli, Hongxu Shen, Xiucheng Zhong, Jiang Liu, Abidan Ainiwaer, and Aizihaierjiang Yusufu. 2026. "CDE: A Concept-Driven Joint Extraction Method for Computer Science Textbooks" Applied Sciences 16, no. 12: 5961. https://doi.org/10.3390/app16125961
APA StyleYusufu, A., Shen, H., Zhong, X., Liu, J., Ainiwaer, A., & Yusufu, A. (2026). CDE: A Concept-Driven Joint Extraction Method for Computer Science Textbooks. Applied Sciences, 16(12), 5961. https://doi.org/10.3390/app16125961

