CDE: A Concept-Driven Joint Extraction Method for Computer Science Textbooks

Yusufu, Aizierguli; Shen, Hongxu; Zhong, Xiucheng; Liu, Jiang; Ainiwaer, Abidan; Yusufu, Aizihaierjiang

doi:10.3390/app16125961

Open AccessArticle

CDE: A Concept-Driven Joint Extraction Method for Computer Science Textbooks

by

Aizierguli Yusufu

^1,2,†,

Hongxu Shen

^1,†,

Xiucheng Zhong

¹,

Jiang Liu

^3,*

,

Abidan Ainiwaer

⁴ and

Aizihaierjiang Yusufu

^1,3,*

¹

School of Computer Science and Technology, Xinjiang Normal University, Urumqi 830054, China

²

Xinjiang Engineering Technology Research Center for Smart Education and Application, Urumqi 830054, China

³

Key Laboratory of Aerospace Information Security and Trusted Computing, School of Cyber Science and Engineering, Wuhan University, Wuhan 430000, China

⁴

School of Journalism and Communication, Xinjiang University, Urumqi 830046, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2026, 16(12), 5961; https://doi.org/10.3390/app16125961 (registering DOI)

Submission received: 5 May 2026 / Revised: 26 May 2026 / Accepted: 3 June 2026 / Published: 12 June 2026

Download

Browse Figures

Versions Notes

Abstract

Addressing the challenges of dense conceptual content and intricate knowledge relations in computer science textbooks, where traditional pipeline-based information extraction suffers from error propagation and semantic decoupling, this paper proposes a concept-driven joint extraction method termed CDE (Concept-Driven Extraction).First, the model’s ability to focus on domain-specific terminology is enhanced through conceptual priors and attention re-weighting. This is integrated with a predefined schema and structured instruction templates to achieve normalized output for both entities and relations. Second, efficient domain knowledge transfer for computer science textbooks is realized by performing Low-Rank Adaptation (LoRA) fine-tuning on the Qwen3-4B large language model. Finally, the construction of the computer science textbook knowledge graph is accomplished using the Neo4j graph database. On a self-constructed instruction dataset of computer science textbooks, CDE achieves an F1 score of 81.83%, representing an improvement of approximately 2.47 percentage points over the LKD-KGC baseline. This performance significantly surpasses that of traditional pipeline models and existing joint extraction approaches. Experimental results demonstrate that CDE can effectively improve knowledge extraction accuracy in the textbook domain, thereby providing a novel research avenue for the rapid construction of knowledge graphs for computer science educational materials.

Keywords:

knowledge graph; large language model; joint extraction

1. Introduction

Textbook knowledge graphs [1] organize and represent concepts, knowledge points, and their interrelations within textbooks in a graph-structured format, wherein nodes denote various entities present in the textbook and edges describe the relationships among these entities, thereby clearly reflecting the intrinsic knowledge structure of the textbook content. With the deepening advancement of digital transformation in education, textbook knowledge graphs have emerged as a pivotal approach for organizing and comprehending domain-specific knowledge, and by structuring dispersed knowledge points within textbooks into a coherent and systematic knowledge framework, they provide essential knowledge support for educational applications such as intelligent question answering and personalized learning path recommendation [2,3].

In recent years, research on textbook knowledge graphs, both domestically and internationally, has predominantly focused on critical tasks such as entity recognition [4] and relation extraction [5], with the overarching goal of automatically identifying entities from textbook text and extracting the semantic relations that exist among them. Computer science textbooks are typically characterized by dense conceptual content, an abundance of specialized terminology, and intricate knowledge relations; when processing such texts, conventional pipeline-based approaches that decouple named entity recognition and relation extraction are prone to issues such as error propagation and semantic decoupling between the two tasks, which ultimately compromise the overall extraction performance [6]. For instance, the KnowEdu method [7] employs a sequence labeling strategy to identify entities within the educational domain and concurrently leverages association rule mining techniques to derive relations among these entities. Zou et al. [8] integrated database textbook content with MOOC resources and utilized a pre-trained BERT model to mine semantic connections between knowledge points, thereby constructing an educational knowledge graph. The MEduKG method proposed by Li et al. [9] first utilizes a BiLSTM-CRF model to accomplish educational entity recognition and subsequently incorporates positional information into a BERT model to perform relation extraction, ultimately yielding a multimodal educational knowledge graph. Nevertheless, traditional pipeline-based extraction approaches generally suffer from error propagation and semantic decoupling, which restrict both the accuracy of triple extraction and the overall quality of knowledge graph construction.

To address these limitations, researchers have progressively shifted toward the paradigm of joint entity and relation extraction [10], which effectively mitigates the aforementioned issues by simultaneously outputting entities and their relations within a unified framework. For example, CasRel [11] employs a cascaded binary tagging framework for triple extraction, wherein subject entities are first identified within a sentence, and then, conditioned on a given subject, corresponding object entities are predicted for each relation; TPLinker [12] adopts a handshaking tagging strategy that reformulates joint extraction as a token-pair link prediction task, enabling the simultaneous recognition of entities and relations through single-stage modeling. However, the majority of these methods are grounded in supervised learning and heavily rely on high-quality annotated corpora [13], a reliance that not only incurs substantial manual annotation costs and inefficiencies but also struggles to accommodate the rapidly evolving concepts and terminologies inherent in computer science textbooks, thereby constraining model generalizability and scalability.

With the rapid development of large language models (LLMs) in natural language understanding and generation, the task of information extraction is undergoing a novel technical paradigm shift: by reformulating the extraction task as a structured sequence generation problem, LLMs can simultaneously output entity–relation triples within a unified framework [14]. For instance, Son et al. [15] designed a prompt-based learning guidance mechanism termed GRASP, which explicitly models relational semantics in dialogues, thereby effectively enhancing the model’s capacity to comprehend complex semantic structures and improving the accuracy of dialogue relation extraction. Chen et al. [16] devised a knowledge-aware prompt-tuning method called KnowPrompt, which injects structured knowledge of entities and relations into LLMs and jointly optimizes the representations of template words and answer words, thereby bolstering the model’s relation extraction performance under few-shot settings. The Universal Information Extraction (UIE) framework proposed by Lu et al. [17] introduces structured prompts as unified task instructions, combined with schema-based contrastive learning, to achieve unified modeling across diverse information extraction tasks. RelationPrompt [18] utilizes a prompt-template-based synthetic data generation strategy to achieve zero-shot relational triple extraction; by designing task-oriented prompts, it guides the LLM to automatically generate high-quality textual relation pairs, thus augmenting the model’s ability to recognize novel relation types. Zhang et al. [19] designed an “Extract-Define-Canonicalize” (EDC) three-stage framework, which addresses the challenge of large-scale schemas exceeding the LLM context window through open information extraction, schema definition, and post hoc canonicalization, enabling the extraction of high-quality triples without parameter fine-tuning. Sun et al. [20] proposed an unsupervised, domain-specific knowledge graph construction framework that autonomously analyzes document corpora, infers knowledge dependencies, and autoregressively generates entity schemas through LLM-driven knowledge dependency parsing. Lu et al. [21] introduced KARMA, a multi-agent framework that leverages nine collaborative agents to perform tasks including entity discovery, relation extraction, schema alignment, and conflict resolution, thereby automatically identifying novel entities from unstructured text and expanding knowledge graphs.

However, existing research predominantly targets general-domain texts, such as news articles [22] and encyclopedic entries [23]; direct application of these methods to information extraction in the domain of computer science textbooks reveals notable limitations: (1) Fine-tuning-free frameworks rely solely on the general knowledge inherent in LLMs and lack the ability to model the unique core concepts found in computer science textbooks, making them prone to overlooking key terms or generating hallucinations that result in inaccurate triples. (2) Although existing methods incorporate external knowledge for relation classification, their output formats do not directly conform to predefined schemas, making the generated results difficult to use directly for knowledge graph construction tasks. (3) The models lack efficient parameter adaptation mechanisms when processing computer science domain knowledge, resulting in insufficient adaptability.

Based on the above issues, this paper proposes a concept-driven joint extraction method called CDE (Concept-Driven Extraction). The core innovations of CDE are as follows: (1) the introduction of a concept-driven mechanism that actively extracts core concepts from textbooks to guide the model’s attention distribution, rather than passively relying on the general knowledge of LLMs; (2) designing a schema-constrained generation strategy that injects predefined entity and relationship types as hard constraints into the LLM’s decoding process, fundamentally ensuring the standardization of the output; and (3) adopting a lightweight adaptation strategy, using LoRA [24] efficiently fine-tune the Qwen3-4B [25] model, injecting domain knowledge at a very low parameter cost.

The main contributions of this paper include the following:

1.: A concept-driven knowledge unit generation mechanism is proposed, which extracts core concepts from textbook chapters and constructs concept-enhanced instructions; the model’s sensitivity to domain-specific terminology is heightened via attention re-weighting.
2.: A schema-constrained structured generation mechanism is devised, wherein entity- and relation-type schemas are injected into the instruction templates; this, combined with decoding constraint strategies, ensures that the output triple sequences strictly adhere to JSON Schema specifications.
3.: An efficient parameter adaptation strategy leveraging LoRA is adopted, which performs domain-specific fine-tuning while freezing the majority of the base model parameters; this approach significantly reduces training overhead and mitigates model hallucination.

2. Methods

2.1. Overall Framework of the Proposed Method

The overall architecture of CDE is illustrated in Figure 1, comprising three core components: an input layer, a model layer, and an output layer. First, textbook text segments, chapter-level conceptual priors, and predefined schema constraints are integrated to formulate an enhanced prompt, thereby directing the model’s attention toward critical entities and relations. Subsequently, the structured instruction template is fed into a Qwen3-4B model fine-tuned via LoRA, which leverages the Transformer’s self-attention mechanism for semantic encoding and generation. Finally, the model produces a structured sequence of knowledge triples under the constraints of a JSON Schema. CDE reframes information extraction as an instruction-driven structured sequence generation task, where the input consists of natural language text and the output is a schema-compliant JSON array, thereby achieving end-to-end joint extraction.

2.2. Concept-Driven Knowledge Unit Generation Mechanism

2.2.1. Construction of Conceptual Priors

For the content of each chapter in computer science textbooks, core concepts are extracted from the table of contents, chapter titles, and subsection summaries. Through deduplication, normalization, and domain alignment, a standardized and unified concept set is constructed:

C = \{c_{1}, c_{2}, \dots, c_{K}\}

(1)

where each entry represents a core domain-specific term, such as “Process Control Block,” “Binary Tree,” and “Bubble Sort.”

To align the concepts with the word embeddings of the large language model, each concept is embedded into a vector representation, yielding its corresponding concept vector:

v_{c_{i}} = Embed (c_{i}) \in R^{d}

(2)

where Embed(ci) denotes the dimensionality of the model’s word embedding vector.

The complete concept set is represented as

C = [v_{c_{1}}; v_{c_{2}}; \dots; v_{c_{K}}] \in R^{K \times d_{ψ}}

(3)

2.2.2. Construction of Concept-Enhanced Instructions

Traditional instructions contain only task descriptions and input text. CDE explicitly concatenates the concept set into the instruction, thereby forming a concept-enhanced instruction:

I^{e n h} = Format (I_{task}, X, C)

(4)

where

I_{task}

denotes the generic task description and

Format (\cdot)

represents a template-based organization function that integrates the task description, core concepts, and textbook segments into a unified instruction text. This instruction explicitly informs the model which concepts are central to the given chapter, thereby guiding the model to prioritize attention to these terms and their relationships during the generation process.

2.2.3. Concept-Driven Attention Re-Weighting

To further enforce the conceptual constraint, CDE performs attention re-weighting during the decoding phase. Let the hidden state representation of the text be denoted as

[h_{1}, h_{2}, \dots, h_{n}]

; the maximum cosine similarity between each position and the concept vectors is computed as

α_{j} = max_{i} sim (h_{j}, v_{c_{i}})

(5)

where

sim (\cdot, \cdot)

denotes the cosine similarity, and subsequently

α_{j}

is normalized and mapped to the interval

[1, 1 + λ]

to obtain the concept weights:

w_{j} = 1 + λ \cdot \frac{α_{j} - min (α)}{max (α) - min (α)}

(6)

During decoding, the self-attention weight matrices are re-weighted as follows:

{\tilde{A}}_{t, j} = {softmax}_{j} (A_{t, j} \cdot w_{j})

(7)

where

A_{t, j}

denotes the original attention score assigned by the model to position the text when generating the token and

{\tilde{A}}_{t, j}

represents the attention distribution after incorporating the conceptual constraint.

First, the core concepts extracted from the textbook chapters are mapped to a set of concept vectors via word embeddings; then, the maximum cosine similarity between the hidden state at each position in the input text and all concept vectors is calculated, normalized, and mapped to the interval

[1, 1 + λ]

to obtain the concept weight for each position; finally, these weights are directly multiplied by the original attention scores, followed by softmax normalization, thereby generating an attention distribution enhanced by conceptual bias. This mechanism ensures that text positions related to core concepts receive higher attention weights, effectively suppressing interference from irrelevant information. This mechanism ensures that text positions associated with core concepts receive higher attention weights, effectively suppressing interference from irrelevant information, as illustrated in Figure 2.

2.3. Schema-Constrained Structured Generation Mechanism

2.3.1. Schema Design

To accommodate the knowledge variations across different textbooks, CDE designs a dedicated entity-type schema and relation-type schema for each textbook, as presented in Table 1.

2.3.2. Schema Injection and Structured Instruction Template

The predefined schema set is explicitly incorporated into the instruction template and integrated with the concept-enhanced instruction to form a complete structured instruction template. As illustrated in Figure 3, this instruction not only includes the task description, core concepts, and input text but also explicitly specifies the permitted entity types and relation types, thereby ensuring that the model is constrained by the schema during both inference and generation.

2.3.3. Schema-Constrained Decoding Strategy

During the decoding phase, this paper further devises a corresponding constrained generation strategy centered on the schema. Given the input X, the enhanced instruction

I^{enh}

, and the model parameter

θ

, the generation distribution of the output sequence is expressed as

P (Y ∣ X, I^{enh}; θ)

(8)

When the schema constraint is introduced, the output space is restricted to

y_{S} = {Y ∣ Y ⊧ S}

(9)

The generation process under the constrained condition can then be formulated as

\hat{Y} = \underset{Y \in Y_{S}}{argmax} P (Y ∣ X, I^{enh}; θ)

(10)

During the decoding phase, CDE imposes constraints at three hierarchical levels: (1) Vocabulary-level constraint: When generating structural symbols such as key names, brackets, and commas, the candidate token set is restricted to prevent invalid field names. (2) Structural-level constraint: The JSON bracket hierarchy and array positions are tracked in real time to avoid structural imbalances. (3) Post-processing repair mechanism: The model output undergoes JSON parsing; in the event of failure, missing fields are automatically supplemented or extraneous fields are removed to ensure that the result remains parsable.

2.4. Efficient Parameter Adaptation via LoRA

To achieve efficient and low-cost model training, this paper adopts the Low-Rank Aaptation (LoRA) fine-tuning technique. The core principle involves freezing the pre-trained weights of the model and introducing low-rank matrix decomposition, whereby only a small set of additional parameters is trained to approximate the weight updates. A schematic illustration of the LoRA mechanism is presented in Figure 4.

For a pre-trained weight matrix

W_{0} \in R^{d \times k}

, its update is constrained to a low-rank decomposition form:

H = W_{0} x + Δ W x = W_{0} x + B A x

(11)

where

B \in R^{d \times r}

and

A \in R^{r \times k}

are the low-rank decomposition matrices, and the rank is denoted as

r ≪ min (d, k)

. During the fine-tuning process, the extensive parameters

W_{0}

of the original model are completely frozen, while only the two newly introduced low-rank matrices A and B remain trainable. The LoRA algorithm enables the efficient injection of domain knowledge from computer science textbooks into the model with minimal computational cost and without altering the base model parameters. This facilitates precise adaptation to the specific triple extraction task while introducing negligible additional latency during inference.

3. Experimental Design and Result Analysis

3.1. Dataset Construction

This paper selects four computer science textbooks—Operating Systems, Computer Organization and Architecture, Introduction to Software Engineering, and Data Structures—as data sources. An information extraction dataset is constructed using a strategy of “large language model pre-annotation plus manual verification and correction,” which is subsequently converted into an instruction dataset for instruction fine-tuning of the model. First, the textbooks are preprocessed, including the removal of headers, footers, and formatting errors, as well as segmentation into corpus samples. Second, candidate triples are automatically generated by prompting a large language model with a pre-annotation template applied to the textbook samples. Domain experts then conduct spot checks on the generated outputs to verify accuracy and domain relevance. Erroneous annotations identified during verification are fed back into the pre-annotation template to iteratively refine prompts, constraint rules, and output formats, followed by regeneration of pre-annotations. The pre-annotation prompt template is presented in Table 2.

Since large language models may exhibit biases in identifying the boundaries of domain-specific terminology, distinguishing fine-grained relationships, and inferring implicit relationships, this paper builds upon pre-annotated results by introducing manual verification and iterative feedback. Four master’s students specializing in knowledge graphs were assigned to manually review a random sample of 30% of the data. For triples where discrepancies arose, the final label was determined through majority voting. This process primarily encompasses the following aspects: (1) verifying the completeness of entity boundaries to prevent truncation or over-expansion of technical terms, (2) verifying that relationships are supported by textual evidence to prevent the model from making excessive inferences, and (3) ensuring that the output structure complies with predefined JSON specifications.

A total of 31,926 triples were ultimately obtained as the information extraction dataset, which was subsequently converted into an instruction format, yielding an instruction dataset comprising 2219 samples. The dataset was partitioned into training, validation, and test sets at an 8:1:1 ratio, providing a data foundation for subsequent model training and evaluation. An example from the instruction dataset is illustrated in Figure 5.

3.2. Experimental Setup

The experimental hardware configuration in this study is primarily based on a remote server running Ubuntu 20.04.5 LTS and equipped with an NVIDIA GeForce RTX 4090 GPU featuring 24 GB of video memory. The development environment utilizes Python 3.10. Regarding model training, lightweight LoRA fine-tuning is conducted on the Qwen3-4B model using the LLaMA-Factory framework. The training hyperparameters are set as follows: number of epochs = 3.0, initial learning rate = 2 × 10^{$- 4$}, maximum sequence length = 2048, batch size = 8, LoRA

α = 16

, and LoRA rank = 8. Also, the rsLoRA mechanism is employed and the AdamW optimizer is adopted.

To evaluate the performance of the proposed CDE method on the task of information extraction from computer science textbooks, accuracy, precision, recall, and F1 score are adopted as the primary evaluation metrics. Their formal definitions are as follows:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(12)

Precision = \frac{T P}{T P + F P}

(13)

Recall = \frac{T P}{T P + F N}

(14)

F 1 = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}

(15)

where TP denotes the number of true positives, TN denotes the number of true negatives, FP denotes the number of false positives, and FN denotes the number of false negatives.

3.3. Analysis of Training Process

To further evaluate the convergence characteristics of CDE on the information extraction task for computer science textbooks, the original loss curve and its smoothed counterpart recorded during LoRA fine-tuning are presented in Figure 6. Within the first 50 steps, the loss rapidly decreases from 5.6 to approximately 0.5, indicating that the concept-enhanced instructions and schema constraints enable the model to adapt swiftly to the task. Between steps 200 and 600, the loss oscillates within the range of 0.2 to 0.1. After 800 steps, the loss stabilizes and converges to around 0.05. This demonstrates that the model has sufficiently learned the knowledge structure inherent in the textbooks and has attained a favorable convergence state. The smoothed curve shown in the figure clearly illustrates this trend.

3.4. Comparative Experiments

This paper selects several representative models for comparative experiments. To ensure the stability of the results, all experiments were independently repeated five times, and the mean and standard deviation were used as the final performance metrics. In addition, a paired t-test was used to assess the statistical significance of the differences between the CDE and the baseline models, with the experimental results presented in Table 3.

BERT-BiLSTM-CRF [9] integrates the BERT pre-trained language model with a BiLSTM-CRF architecture, performing named entity recognition first followed by relation extraction, thereby constituting a conventional pipeline-based approach for information extraction.
CasRel [11] reformulates relation extraction as a relation-specific cascaded binary tagging process for subjects and objects, enabling joint extraction of entities and relations within a unified framework.
TPLinker [12] employs a handshaking tagging strategy to transform joint extraction into a token-pair link prediction task, thereby facilitating the joint extraction of overlapping relations and complex semantic structures.
Qwen3-4B [25] is an open-source language model from the Qwen3 series with 4B parameters, possessing robust capabilities in general text understanding and generation and supporting relatively long contextual inputs.
UIE [17] introduces structured prompts as unified task instructions and incorporates schema-based contrastive learning to achieve unified modeling across diverse information extraction tasks.
EDC [19] designs a three-stage framework termed “Extract-Define-Canonicalize,” which addresses the challenge of large-scale schemas exceeding the context window of LLMs through open information extraction, schema definition, and post hoc canonicalization.
LKD-KGC [20] autonomously analyzes document corpora, infers knowledge dependencies, and autoregressively generates entity schemas through LLM-driven knowledge dependency parsing. Furthermore, it incorporates entity-linking information from external knowledge bases during the relation classification stage to enhance the model’s discriminative capacity for domain-specific relations.

From the data perspective, CDE achieves an F1 score of 81.83%, representing an improvement of 2.47 percentage points over LKD-KGC and approximately 4.5 percentage points over TPLinker. This indicates that, within the textbook domain, the introduction of conceptual priors and schema constraints can effectively guide the model to capture domain-specific semantic structures and mitigate relation-type confusion and boundary prediction errors. The non-fine-tuned Qwen3-4B exhibits the weakest performance, with an F1 score of merely 59.61%, suggesting that relying solely on general language understanding capabilities is insufficient for handling the dense and implicitly expressed relations in textbook content, thus underscoring the necessity of domain adaptation. The pipeline-based BERT-BiLSTM-CRF method suffers from error propagation, yielding an F1 score of only 67.43%, which is significantly lower than that of various joint extraction models. This further corroborates the essential role of the joint extraction framework in mitigating error transmission.

3.5. Ablation Study

To investigate the contribution of each core module within CDE, an ablation study is conducted. Using the complete CDE framework as the baseline, different model variants are constructed by sequentially removing three core components: the model without the concept-driven mechanism (w/o Concept), the model without schema constraints (w/o Schema), and the model without LoRA fine-tuning (w/o LoRA). The performance of these three variant models and the complete CDE framework is evaluated on the self-constructed instruction dataset for information extraction from computer science textbooks, with the F1 score serving as the primary evaluation metric. The experimental results are presented in Table 4.

Analysis of the data presented in the table reveals the contribution of each module to overall performance: removing conceptual priors leads to a decrease in F1 score of 4.33 percentage points, with a particularly pronounced decline in recall, indicating that conceptual guidance is crucial for entity recall. Removing schema constraints results in an F1 score reduction of 5.86 percentage points, accompanied by a notable drop in precision, which underscores the pivotal role of structured constraints in mitigating output format errors and confusion among relation labels. The removal of LoRA fine-tuning precipitates a precipitous decline in F1 performance, thereby corroborating the necessity of domain-specific parameter adaptation.

3.6. Case Study

To visually demonstrate the effectiveness of CDE, this section presents a case study using the definition of “process” from the textbook Operating Systems. The results of comparing the outputs of CDE and the baseline model are shown in Table 5.

The comparison shows that the baseline model captured only broad “is-a” relationships, whereas CDE, leveraging its concept-driven mechanism, was able to accurately identify core concepts, such as “resource allocation” and “scheduling,” and generated multiple fine-grained, well-structured triples within the constraints of the schema.

3.7. Knowledge Graph Construction Instance

To enable efficient storage, querying, and visual analysis of the structured knowledge, this study employs the Neo4j graph database for constructing the computer science textbook knowledge graph. The triple sets generated by CDE are consolidated and subsequently imported efficiently into the database using the bulk loading functionality provided by Neo4j. In this process, nodes in the graph represent the head and tail entities of the triples, while directed edges connecting the nodes denote the relations within the triples. Ultimately, a total of 17,708 entity nodes, 73 distinct relation types, and 31,926 triples are constructed, thereby completing the development of the computer science textbook knowledge graph. Figure 7 illustrates a partial visualization of the knowledge graph centered on the core entity Operating System.

4. Conclusions

To address the challenges of information noise, knowledge incompleteness, and model hallucination encountered during the automated construction of knowledge graphs from computer science textbooks, this paper proposes and implements a concept-driven joint extraction method termed CDE. The proposed approach seamlessly integrates large language models with knowledge graph techniques, leveraging conceptual prior guidance, structured schema constraints, and an efficient LoRA-based adaptation strategy to achieve high-quality knowledge extraction and graph construction tailored to the domain of computer science textbooks. This methodology thereby provides a viable pathway for knowledge-driven applications in computer science education.

However, the construction of conceptual priors relies heavily on structured textbook knowledge; for texts with loose chapter organization or those that are not chapter-based, the effectiveness of this method may decline. To address this limitation, future research will introduce a keyword extraction method based on TextRank and combine it with external subject-specific knowledge bases to automatically construct conceptual priors. This approach will reduce reliance on the table of contents and summaries, thereby enhancing the robustness of CDE across various types of textbook texts. Furthermore, we will explore domain-adaptive fine-tuning and zero-shot prompt learning strategies to enhance the model’s extraction capabilities across textbooks in other disciplines using a small number of domain-labeled samples.

Author Contributions

Conceptualization, A.Y. (Aizierguli Yusufu), H.S. and X.Z.; methodology, H.S.; formal analysis, A.A.; investigation, H.S.; data curation, A.Y. (Aizihaierjiang Yusufu); writing—original draft preparation, H.S.; writing—review and editing, J.L.; visualization, A.Y. (Aizihaierjiang Yusufu); supervision, A.Y. (Aizierguli Yusufu); funding acquisition, A.Y. (Aizierguli Yusufu). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Xinjiang Uygur Autonomous Region Innovation Environment (Talent and Base) Construction Special Program, Natural Science Program (Special Training for Ethnic Minority Scientific and Technological Talents) (Grant No. 2025D03032); the Youth Top-notch Talent Support Program of Xinjiang Normal University (Grant No. XJNUQB2022-22); the Tender Project of the Engineering Research Center of Smart Education of Xinjiang Normal University (Grant No. XJNU-ZHJY202403); the National Natural Science Foundation of China (Grant No. 61662081); and the National Social Science Foundation of China (Grant No. 14AZD11).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available on request due to restrictions privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, J.; Leng, F.; Wu, W.; Bao, Y. A method for constructing textbook knowledge graphs based on multimodality and knowledge distillation. J. Front. Comput. Sci. Technol. 2024, 18, 2901–2911. [Google Scholar]
Liu, Q.; Li, Y.; Duan, H.; Liu, Y.; Qin, Z. A Survey of Knowledge Graph Construction Techniques. J. Comput. Res. Dev. 2016, 53, 582–600. [Google Scholar]
Li, Z.; Zhou, D. Research on Conceptual Model and Construction Methods of Educational Knowledge Graph. e-Educ. Res. 2019, 40, 78–86. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; Xu, B. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 207–212. [Google Scholar]
Chan, Y.S.; Roth, D. Exploiting background knowledge for relation extraction. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010); Association for Computational Linguistics: Stroudsburg, PA, USA, 2010; pp. 152–160. [Google Scholar]
Chen, P.; Lu, Y.; Zheng, V.W.; Chen, X.; Yang, B. Knowedu: A system to construct knowledge graph for education. IEEE Access 2018, 6, 31553–31563. [Google Scholar] [CrossRef]
Zou, X.; Lin, H.; Wu, J.; Zheng, C.; Guan, Q. Constructing a knowledge graph for the database course group via deep learning. In Proceedings of the 2024 13th International Conference on Educational and Information Technology (ICEIT), Chengdu, China, 22–24 March 2024; pp. 334–339. [Google Scholar]
Li, N.; Shen, Q.; Song, R.; Chi, Y.; Xu, H. MEduKG: A deep-learning-based approach for multi-modal educational knowledge graph construction. Information 2022, 13, 91. [Google Scholar] [CrossRef]
Miwa, M.; Bansal, M. End-to-end relation extraction using lstms on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 1105–1116. [Google Scholar]
Wei, Z.; Su, J.; Wang, Y.; Tian, Y.; Chang, Y. A novel cascade binary tagging framework for relational triple extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1476–1488. [Google Scholar]
Wang, Y.; Yu, B.; Zhang, Y.; Liu, T.; Zhu, H.; Sun, L. TPLinker: Single-stage joint extraction of entities and relations through token pair linking. In Proceedings of the 28th International Conference on Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1572–1582. [Google Scholar]
Zheng, S.; Wang, F.; Bao, H.; Hao, Y.; Zhou, P.; Xu, B. Joint extraction of entities and relations based on a novel tagging scheme. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 1227–1236. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Son, J.; Kim, J.; Lim, J.; Lim, H.S. GRASP: Guiding model with RelAtional semantics using prompt for dialogue relation extraction. In Proceedings of the 29th International Conference on Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 412–423. [Google Scholar]
Chen, X.; Zhang, N.; Xie, X.; Deng, S.; Yao, Y.; Tan, C.; Huang, F.; Si, L.; Chen, H. Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In Proceedings of the ACM Web Conference 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 2778–2788. [Google Scholar]
Lu, Y.; Liu, Q.; Dai, D.; Xiao, X.; Lin, H.; Han, X.; Sun, L.; Wu, H. Unified structure generation for universal information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 5755–5772. [Google Scholar]
Chia, Y.K.; Bing, L.; Poria, S.; Si, L. RelationPrompt: Leveraging prompts to generate synthetic data for zero-shot relation triplet extraction. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 45–57. [Google Scholar]
Zhang, B.; Soh, H. Extract, define, canonicalize: An llm-based framework for knowledge graph construction. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 9820–9836. [Google Scholar]
Sun, J.; Qian, S.; Han, Z.; Li, W.; Qian, Z.; Yang, D.; Cao, J.; Xue, G. LKD-KGC: Domain-specific KG construction via LLM-driven knowledge dependency parsing. arXiv 2025, arXiv:2505.24163. [Google Scholar]
Lu, Y.; Wu, W.; Zhao, X.; Peng, R.; Wang, J. Karma: Leveraging multi-agent llms for automated knowledge graph enrichment. arXiv 2025, arXiv:2502.06472. [Google Scholar]
Kuculo, T.; Abdollahi, S.; Gottschalk, S. Transformer-Based Architectures Versus Large Language Models in Semantic Event Extraction: Evaluating Strengths and Limitations. Semant. Web 2025, 16, 22104968251363759. [Google Scholar] [CrossRef]
Popovic, N.; Kangen, A.; Schopf, T.; Färber, M. DocIE@ XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations. In Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 298–309. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 25–29 April 2022; pp. 1–3. [Google Scholar]
Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 technical report. arXiv 2025, arXiv:2505.09388. [Google Scholar] [CrossRef]

Figure 1. Overall framework of CDE formatting.

Figure 2. Concept-driven attention re-weighting.

Figure 3. Schema injection and concept-enhanced structured instruction template.

Figure 4. Schematic illustration of the LoRA mechanism.

Figure 5. Example of the instruction dataset.

Figure 6. Training loss curve of LoRA fine-tuning.

Figure 7. Visualization of the knowledge graph.

Table 1. Excerpt of entity-type schema and relation-type schema.

Textbook Category	Entity Type	Relation Type
Operating Systems	OS Module, Interface, Mechanism, Algorithm, Process, Thread, IPC, and Security.	PartOf, Implements, Supports, DependsOn, UsedFor, Allocates, Protects, Schedules, and DefinesAs
Computer Organization and Architecture	Computer System, Bus, Cache, ALU, CPU, Register, Instruction Set, I/O Interface, DMA, and Interrupt	PartOf, IsA, Provides, Implements, DependsOn, Handles, Optimizes, and UsedFor
Software Engineering	Process Model, Requirement, Project Artifact, UML Diagram, Design Principle, and Testing Technique	PartOf, IsA, DefinesAs, Produces, Implements, Verifies, Validates, Uses, Measures, Manages, and Mitigates
Data Structures	Linear List, Stack, Queue, Tree Structure, Graph Structure, Hash Structure, and Search Algorithm	PartOf, IsA, Implements, Supports, BasedOn, MapsTo, HasProperty, Optimizes, Traverses, Orders, and Solves

Table 2. Pre-annotation prompt template based on large language models.

You are a professional information extraction expert responsible for extracting entities and relations from textbooks. Based on the following example, use the predefined entity types and relation types to extract entities and relations within the “Operating Systems” domain from the provided text. The output must be a JSON list, where each JSON object contains the keys: “entity1”, “entity_type1”, “relation”, “entity2”, “entity_type2”.

Entity type set: {entity_schema}

Relation type set: {relation_schema}

Example: “A time-sharing system allows multiple users to interactively use the computer, and the system employs time-slice rotation to ensure timely response.”

Output: {“entity1”: “time-sharing system”, “entity_type1”: “OS Type”, “relation”: “Implements”, “entity2”: “time-slice rotation”, “entity_type2”: “Mechanism”}

Text: {text}

Table 3. Comparative experiments of different models (mean ± standard deviation).

Model	Acc	Precision	Recall	F1
BERT-BiLSTM-CRF	0.6812 ± 0.011	0.7021 ± 0.015	0.6487 ± 0.019	0.6743 ± 0.012
CasRel	0.7583 ± 0.009	0.7721 ± 0.010	0.7456 ± 0.013	0.7586 ± 0.008
TPLinker	0.7718 ± 0.007	0.7846 ± 0.009	0.7621 ± 0.011	0.7732 ± 0.007
Qwen3-4B	0.5964 ± 0.021	0.6231 ± 0.023	0.5714 ± 0.028	0.5961 ± 0.019
UIE	0.7425 ± 0.012	0.7553 ± 0.014	0.7311 ± 0.016	0.7430 ± 0.011
EDC	0.7632 ± 0.009	0.7741 ± 0.009	0.7564 ± 0.012	0.7651 ± 0.010
LKD-KGC	0.7931 ± 0.008	0.8005 ± 0.011	0.7867 ± 0.012	0.7936 ± 0.009
CDE	0.8171 ± 0.006	0.8245 ± 0.007	0.8122 ± 0.009	0.8183 ± 0.006

Table 4. Ablation study.

Model	Precision	Recall	F1
CDE	0.8245	0.8122	0.8183
w/o Concept	0.7812	0.7689	0.7750
w/o Schema	0.7524	0.7671	0.7597
w/o LoRA	0.7031	0.6415	0.6709

Table 5. Comparison of extraction results for excerpts from

O p e r a t i n g

S y s t e m s

.

Table 5. Comparison of extraction results for excerpts from

O p e r a t i n g

S y s t e m s

.

Input Text	Baseline Model Output	CDE Model Output
“A process is the basic unit of resource allocation and scheduling in an operating system.”	`{"entity1":"process", "relation":"is", "entity2":"basic unit"}`	`{"knowledge_units":[`
		`{"entity1":"process", "relation":"DefinesAs", "entity2":"the basic unit."},`
		`{"entity1":"Process", "relation":"UsedFor", "entity2":"Resource Allocation"},`
		`{"entity1":"Operating System", "relation":"Schedules", "entity2":"Process"}]}`

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yusufu, A.; Shen, H.; Zhong, X.; Liu, J.; Ainiwaer, A.; Yusufu, A. CDE: A Concept-Driven Joint Extraction Method for Computer Science Textbooks. Appl. Sci. 2026, 16, 5961. https://doi.org/10.3390/app16125961

AMA Style

Yusufu A, Shen H, Zhong X, Liu J, Ainiwaer A, Yusufu A. CDE: A Concept-Driven Joint Extraction Method for Computer Science Textbooks. Applied Sciences. 2026; 16(12):5961. https://doi.org/10.3390/app16125961

Chicago/Turabian Style

Yusufu, Aizierguli, Hongxu Shen, Xiucheng Zhong, Jiang Liu, Abidan Ainiwaer, and Aizihaierjiang Yusufu. 2026. "CDE: A Concept-Driven Joint Extraction Method for Computer Science Textbooks" Applied Sciences 16, no. 12: 5961. https://doi.org/10.3390/app16125961

APA Style

Yusufu, A., Shen, H., Zhong, X., Liu, J., Ainiwaer, A., & Yusufu, A. (2026). CDE: A Concept-Driven Joint Extraction Method for Computer Science Textbooks. Applied Sciences, 16(12), 5961. https://doi.org/10.3390/app16125961

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CDE: A Concept-Driven Joint Extraction Method for Computer Science Textbooks

Abstract

1. Introduction

2. Methods

2.1. Overall Framework of the Proposed Method

2.2. Concept-Driven Knowledge Unit Generation Mechanism

2.2.1. Construction of Conceptual Priors

2.2.2. Construction of Concept-Enhanced Instructions

2.2.3. Concept-Driven Attention Re-Weighting

2.3. Schema-Constrained Structured Generation Mechanism

2.3.1. Schema Design

2.3.2. Schema Injection and Structured Instruction Template

2.3.3. Schema-Constrained Decoding Strategy

2.4. Efficient Parameter Adaptation via LoRA

3. Experimental Design and Result Analysis

3.1. Dataset Construction

3.2. Experimental Setup

3.3. Analysis of Training Process

3.4. Comparative Experiments

3.5. Ablation Study

3.6. Case Study

3.7. Knowledge Graph Construction Instance

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI