A KeyBERT-Enhanced Pipeline for Electronic Information Curriculum Knowledge Graphs: Design, Evaluation, and Ontology Alignment

Zhuang, Guanghe; Lu, Xiang

doi:10.3390/info16070580

Open AccessArticle

A KeyBERT-Enhanced Pipeline for Electronic Information Curriculum Knowledge Graphs: Design, Evaluation, and Ontology Alignment

by

Guanghe Zhuang

and

Xiang Lu

^*

College of Electronic and Information Engineering, Shandong University of Science and Technology, Qingdao 266590, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(7), 580; https://doi.org/10.3390/info16070580

Submission received: 17 May 2025 / Revised: 2 July 2025 / Accepted: 4 July 2025 / Published: 6 July 2025

Download

Browse Figures

Versions Notes

Abstract

This paper proposes a KeyBERT-based method for constructing a knowledge graph of the electronic information curriculum system, aiming to enhance the structured representation and relational analysis of educational content. Electronic Information Engineering curricula encompass diverse and rapidly evolving topics; however, existing knowledge graphs often overlook multi-word concepts and more nuanced semantic relationships. To address this gap, this paper presents a KeyBERT-enhanced method for constructing a knowledge graph of the electronic information curriculum system. Utilizing teaching plans, syllabi, and approximately 500,000 words of course materials from 17 courses, we first extracted 500 knowledge points via the Term Frequency–Inverse Document Frequency (TF-IDF) algorithm to build a baseline course–knowledge matrix and visualize the preliminary graph using Graph Convolutional Networks (GCN) and Neo4j. We then applied KeyBERT to extract about 1000 knowledge points—approximately 65% of extracted terms were multi-word phrases—and augment the graph with co-occurrence and semantic-similarity edges. Comparative experiments demonstrate a ~20% increase in non-zero matrix coverage and a ~40% boost in edge count (from 5100 to 7100), significantly enhancing graph connectivity. Moreover, we performed sensitivity analysis on extraction thresholds (co-occurrence ≥ 5, similarity ≥ 0.7), revealing that (5, 0.7) maximizes the F1-score at 0.83. Hyperparameter ablation over n-gram ranges [(1,1),(1,2),(1,3)] and top_n [5, 10, 15] identifies (1,3) + top_n = 10 as optimal (Precision = 0.86, Recall = 0.81, F1 = 0.83). Finally, GCN downstream tests show that, despite higher sparsity (KeyBERT 64% vs. TF-IDF 40%), KeyBERT features achieve Accuracy = 0.78 and F1 = 0.75, outperforming TF-IDF’s 0.66/0.69. This approach offers a novel, rigorously evaluated solution for optimizing the electronic information curriculum system and can be extended through terminology standardization or larger data integration.

Keywords:

knowledge graph; electronic information curriculum; KeyBERT; TF-IDF; semantic similarity; Neo4j

1. Introduction

Knowledge graphs have become an effective tool for structuring course content and revealing latent relationships in education. However, their semantic fidelity depends critically on the precise extraction of knowledge points. While TF-IDF remains the standard method, its reliance on raw term frequency limits its ability to capture context: in specialized, rapidly evolving curricula such as Electronic Information Engineering, TF-IDF frequently misses multi-word concepts and deeper semantic associations, thereby constraining the pedagogical value of the resulting graphs [1]. Thus, automatically identifying multi-word concepts and strengthening semantic relationships remain critical challenges. Recent work has demonstrated automated approaches to building educational knowledge graphs. For example, Ain et al. [2] combined word embeddings with the SIFRank algorithm to automatically extract key concepts from course texts and assemble them into a knowledge graph, providing valuable guidance in the design of our automated pipeline.

Bidirectional Encoder Representations from Transformers (BERT) is built upon the Transformer encoder block, which employs multi-head self-attention and feed-forward layers to capture contextual information from both the left and right of each token. KeyBERT [3] leverages this encoder’s ability by embedding text with BERT and selecting the phrases whose embeddings are most semantically similar to the document representation, thereby retrieving multi-word terms—such as “digital signal processing”—that TF-IDF overlooks. Figure 1 illustrates the Transformer encoder block [4] underpinning BERT and powering KeyBERT’s semantic strength. By leveraging this architecture, our study utilizes KeyBERT to extract semantically rich, domain-specific phrases, thereby enhancing the educational relevance of the resulting knowledge points.

This paper utilizes teaching plans, syllabi, and approximately 500,000 words of textbook text from 17 courses to construct an enhanced knowledge graph. A TF-IDF baseline “course–knowledge point” graph was first constructed and visualized via GCN and Neo4j. We then applied KeyBERT to enhance knowledge-point extraction and augmented the graph with co-occurrence and semantic-similarity links. Comparative experiments show that the KeyBERT-enhanced graph captures more relevant educational concepts and exhibits higher edge connectivity, thereby providing more substantial support for curriculum optimisation.

The remainder of this paper is structured as follows. Section 2 reviews related work. Section 3 outlines our methodology, which includes data preprocessing, the TF-IDF baseline, and the KeyBERT enhancement. Section 4 presents experiments and analysis. Section 5 discusses limitations and future work. Section 6 concludes the paper.

2. Related Work

Building an educational knowledge graph faces several core challenges, including poor keyword extraction accuracy, weak graph structural design, and sparse semantic relationships among knowledge points. In particular, many existing curriculum knowledge graphs rely on word-level keywords that lack semantic depth, resulting in an incomplete representation of course content. Additionally, traditional graph visualization and data integration mechanisms are often insufficient for fusing diverse educational resources into a coherent knowledge structure. Researchers have addressed these issues by refining keyword extraction techniques and improving graph connectivity, as discussed below.

2.1. Keyword (Knowledge-Point) Extraction

A major challenge in educational knowledge graph construction is accurately extracting meaningful knowledge points from text. Traditional methods such as TF-IDF, TextRank, and YAKE have been widely used for keyword extraction, but each has significant limitations. TF-IDF is purely frequency-based and ignores context; for example, Li et al. [5] built a multi-source college curriculum knowledge graph using TF-IDF-extracted keywords, illustrating the prevalence of this approach.

However, TF-IDF often misses multi-word terms and nuanced concepts, reducing the semantic richness of the graph. Graph-based ranking algorithms such as TextRank and feature-based methods like YAKE similarly focus on surface statistics (word frequency, co-occurrence, position, etc.) and often fail to capture deeper domain-specific meanings.

To address the limitations of these frequency-based techniques, researchers have turned to BERT-based models that incorporate contextual semantics. KeyBERT, built on BERT’s transformer architecture, can identify multi-word phrases and has demonstrated superior performance over traditional methods. For example, Oveh et al. [6] showed that KeyBERT achieved higher accuracy in extracting biomedical terms compared with TF-IDF and other baselines. The strong semantic representation power of BERT-derived models is well documented: Sentence-BERT [7] produces high-quality sentence embeddings that greatly improve semantic similarity evaluations; additionally, access to high-quality domain datasets can further boost keyword extraction performance [8], and domain-specific extraction techniques have achieved higher precision than generic methods in specialized texts [9].

These findings underscore the necessity of leveraging contextual language models like KeyBERT to improve the extraction of keywords (knowledge points) in educational knowledge graphs.

2.2. Graph Construction and Visualisation

Beyond keyword extraction, constructing a well-structured knowledge graph and effectively visualizing it are critical challenges. Many curriculum knowledge graphs have a weak graph structure, partly due to the limited availability of relational data and difficulties in merging content from multiple sources. Efforts have been made to improve data integration—for example, Li et al. [10] proposed CourseKG, an educational knowledge graph built on multi-modal course information (syllabi, textbook metadata, slide decks) that visually represents lesson points and their sequential relationships to support precision teaching.

GCN have emerged as a powerful tool to strengthen graph structure by learning rich node embeddings that capture hidden relationships [11,12]. In educational contexts, researchers have applied GCNs to curriculum graphs to enhance their connectivity and even dynamically update relationships; for instance, one study used GCN to optimize an engineering education knowledge graph in real time [13]. Su et al. [14] focused on MOOC platforms and implemented concept recommendation based on hypergraph GCN, further verifying the applicability of GCN in educational scenarios.

Another challenge lies in graph visualization and user interaction. Traditional graph visualization mechanisms are often insufficient for exploring complex educational knowledge graphs, necessitating more robust solutions. Graph database platforms, such as Neo4j, have been widely adopted to address this issue.

Neo4j provides an interactive environment for storing, querying, and visualizing knowledge graphs, making it easier for educators and researchers to explore relationships. For example, Canal-Esteve and Gutiérrez [15] utilized Neo4j to convert educational materials into an interactive knowledge graph, thereby facilitating instructional design and data exploration. In addition, Niu et al. [16] evaluated an AI-aided educational platform that utilizes a knowledge graph to semantically organize teaching materials and support both teachers and students, providing empirical evidence for the platform application scenarios considered in our work.

The adoption of such visualization tools allows the integration of diverse data into a single graph and offers intuitive graph navigation, thereby overcoming limitations in traditional static graph representations.

2.3. Relationship Enrichment

The sparsity of semantic relationships among knowledge points is a further obstacle in educational knowledge graphs. Without sufficient interconnections between knowledge points, a graph can fail to capture the full pedagogical context and offer limited instructional value. Researchers have found that enriching a knowledge graph with additional relationships significantly improves its utility and coherence. For instance, Zhang et al. [17] combined curriculum knowledge graphs with GCN-based algorithms to recommend personalized learning paths, highlighting that richer connectivity yields better learning outcomes. A recent survey of educational knowledge graphs also emphasizes that the richness of semantic relationships significantly influences the quality and effectiveness of such graphs [18]. Moreover, Rezayi et al. [19] demonstrated that introducing external text-derived relationships (e.g., co-occurrence and textual semantic similarity links) can markedly enhance a graph’s cohesion and connectivity. Additionally, a series of systematic reviews has highlighted the need to supplement educational knowledge graphs with richer relationships during the construction phase. Abu-Salih and Alotaibi [20] conducted a systematic literature review on the construction of knowledge graphs in the field of education, emphasizing the key role of multi-source fusion and entity standardization in enhancing graph quality.

Following these insights, our work supplements the set of KeyBERT-derived knowledge points with additional co-occurrence and semantic similarity edges, ensuring a more densely connected graph structure that alleviates the problem of sparse relationships.

2.4. Transformer and GCN Applications in Educational Knowledge Graphs

In recent years, with the advancement of deep pre-trained models and graph neural networks, an increasing number of studies have incorporated Transformer architectures and Graph Convolutional Networks (GCNs) into the construction and inference of educational knowledge graphs.

2.4.1. Transformer-Driven Concept Extraction

Katyshev et al. [21] applied BERT variants to automatically extract computer science education concepts in SIGCSE 2023, significantly improving coverage. Zhao et al. [22] released EDUKG, a 3.86-billion-triple K-12 educational KG, combining RoBERTa for fine-grained textbook entity extraction.

2.4.2. GCN-Driven Graph Enhancement and Downstream Tasks

ConceptGCN [23] fuses transformer-extracted key phrases with GCNs for MOOC concept recommendation, outperforming TF-IDF + GCN baselines. Li and Wang [24] model heterogeneous student–exercise interactions via GCNs for knowledge tracing, achieving higher prediction accuracy.

2.4.3. End-to-End Transformer + GCN Trends

The adversarial GCN based on degree information proposed by Wang et al. [25] and the graph embedding method with dual regularizer introduced by Wang et al. [26] provide new ideas for knowledge graph structure optimization, further supporting the Transformer + GCN collaborative framework.

These works demonstrate that feeding Transformer-based semantic representations into GCNs can jointly enhance KG structural quality and downstream task performance.

These advances directly motivate our multi-stage pipeline in Section 3. Methodology: from KeyBERT-based concept extraction, through threshold sensitivity and error analysis, to IEEE Taxonomy alignment, and finally GCN-based downstream validation.

2.5. Position of This Study

The novelty of this study lies in proposing a KeyBERT-driven, semantic-enhancement, and Graph Neural Network-supported framework for constructing an Electronic Information curriculum knowledge graph. By integrating KeyBERT-based multi-word keyword extraction, semantic relationship augmentation, GCN-based node embedding, and Neo4j-powered interactive visualization, our approach directly addresses the challenges of low extraction accuracy, weak graph structure, and sparse semantic links inherent in traditional TF-IDF–based pipelines. The resulting knowledge graph provides a richer semantic representation and stronger connectivity, offering a robust tool for curriculum analysis and optimization. A detailed implementation of each component is presented in the following Section 3.

3. Methodology

In contrast to traditional pipelines that rely purely on word frequency and lack deep semantic context, the proposed approach integrates contextual semantic extraction and relationship enrichment to construct a more informative knowledge graph. We outline two workflows: a baseline TF-IDF-based pipeline for comparison, and the proposed KeyBERT-enhanced pipeline supplemented with additional relationship extraction to improve graph connectivity. The following subsections detail each step of our methodology, including data preparation, knowledge points extraction procedures, and the construction and visualization of the resulting knowledge graph.

3.1. Data Preparation

To ensure comprehensive coverage of the curriculum content, we aggregated textual data from multiple sources for each course. In total, 17 courses in the Electronic Information Engineering program (e.g., Signals and Systems, Digital Signal Processing) were collected from documents such as teaching plans, syllabi, and textbook PDFs. We used a PDF parsing tool to extract raw text from these materials, then saved the content of each course as a plain text file.

This data preparation step aims to standardize and clean the textual data for effective keyword extraction. We employed the open-source Jieba library for Chinese word segmentation and part-of-speech (POS) tagging. The name Jieba means “to stutter” in Chinese, and the tool is widely recognized as one of the most efficient and widely used tokenizers for Chinese text. The process involved tokenizing each sentence into words and annotating each token with a POS tag. Tokens that were not relevant for knowledge points—such as verbs, personal names, place names, numerals, and adverbs—were filtered out based on their POS tags. We also applied a manually curated stop word list to remove common words that carry little semantic information (e.g., frequent function words). Through this preprocessing pipeline, the raw multi-source text for each course was transformed into a clean, segmented corpus of terms, providing a consistent input for subsequent keyword extraction steps.

3.2. TF-IDF-Based Construction

This section describes the process of extracting knowledge points from the preprocessed texts of 17 electronic information courses using the TF-IDF method to construct a preliminary knowledge graph, serving as a baseline for subsequent improvements. The objective of the TF-IDF baseline step is to construct an initial knowledge graph using a straightforward, frequency-based method, which will serve as a point of comparison for our enhanced approach.

Initially, knowledge points are extracted from the preprocessed course texts using Scikit-learn’s TfidfVectorizer to compute the TF-IDF value of each term. To ensure the representativeness and computational efficiency of the knowledge points, the parameters are set as max_features = 500 (limiting the maximum number of keywords to 500) and min_df = 2 (requiring a term to appear in at least two documents). High-scoring terms are selected as knowledge points based on their rankings. Subsequently, a “course-knowledge points” matrix is constructed, with rows corresponding to the 17 courses, columns representing the extracted knowledge points, and matrix elements being the TF-IDF scores (set to 0 if a knowledge point does not appear in a course). The matrix is saved as a CSV file for further use.

Based on this matrix, a simplified GCN processing pipeline is designed. GCN, a neural network operating on graph structures, learns node embeddings through convolution operations involving an adjacency matrix and a feature matrix (see Appendix A for details).

In this study, an adjacency matrix (dimension: 517 × 517) and a feature matrix (dimension: 517 × 500) are constructed to generate node embeddings. Finally, the matrix data is imported into the Neo4j database using py2neo, with courses and knowledge points defined as Course and Knowledge nodes, respectively, and the CONTAINS relationship weighted by TF-IDF scores, enabling visualization of the preliminary graph. This baseline graph provides a reference point, though due to the limitations of TF-IDF it contains mostly isolated course-to-term links and lacks rich inter-knowledge relationships.

3.3. Enhanced Extraction with KeyBERT

While the TF-IDF baseline provides an initial set of knowledge points, it lacks contextual understanding and multi-word expressions. This step aims to overcome those limitations by extracting more semantically rich and domain-relevant knowledge points. We introduce KeyBERT, a BERT-based keyword extraction model, to enhance the semantic depth of the knowledge points.

KeyBERT leverages pre-trained language models to capture semantic information, enabling the extraction of more educationally meaningful phrases (e.g., “digital signal processing”) that better align with the characteristics of electronic information courses compared with TF-IDF’s single-word extraction.

KeyBERT’s core steps involve BERT embedding generation and cosine similarity computation. BERT, utilizing the Transformer model, generates contextually relevant word embeddings through multi-head self-attention mechanisms.

KeyBERT uses BERT to generate embeddings for documents and candidate keywords (E_d, E_k) and computes their relevance via cosine similarity, where E_d is the document vector and E is the candidate phrase vector (see Appendix B for details).

To further refine keyword selection, KeyBERT employs Maximal Marginal Relevance (MMR), which strikes a balance between relevance and diversity, ensuring that the selected keywords are not only highly pertinent to the content but also minimally redundant with one another. This approach is similar to the unsupervised extraction method using PLM + part-of-speech patterns proposed by PatternRank [27]. MMR is defined as

M M R = \arg \max_{k_{i} \in R \ S} [λ \cdot S i m (k_{i}, d) - (1 - λ) \cdot \max_{k_{j} \in S} S i m (k_{i}, k_{j})]

(1)

where R is the set of candidate keywords, S is the set of selected keywords, and

λ

is the diversity parameter.

In practice, we applied KeyBERT to the same course texts to extract an expanded set of knowledge points. First, to accommodate the model’s input length constraints, the preprocessed text of each course was segmented into 500-word chunks [28] to ensure semantic integrity. Then, KeyBERT (based on the ‘all-MiniLM-L6-v2’ model) extracted keywords from each segment, with n-gram_range = (1, 3) to capture 1- to 3-word phrases, top_n = 10 to select the top 10 highest-scoring phrases per segment, and MMR (diversity parameter 0.5), which directs KeyBERT to select a mix of terms that are relevant yet not too similar to one another, improving coverage of different subtopics in the course. After processing all segments of a course, the extracted candidates were aggregated.

This process yielded a large pool of candidate knowledge points; we then selected approximately the top 1000 unique knowledge points across all courses based on their relevance scores. Notably, about 65% of these selected knowledge points were multi-word phrases, reflecting a substantial increase in semantic granularity compared with the single-word bias of TF-IDF.

3.4. Enhancement of Knowledge Points Relationships

Even with an improved set of knowledge points, a knowledge graph can remain fragmentary if those points are not well connected to each other. This step aims to enrich the connectivity among knowledge points by extracting additional relationships from the data (motive). In particular, we designed two complementary methods to establish edges between knowledge point nodes: one based on co-occurrence in course documents and another based on the semantic similarity of their content. By adding these relationships, we address the sparse semantic links issue and create a more cohesive graph structure.

The first method, co-occurrence analysis, computes the co-occurrence frequency of knowledge points across courses via matrix transposition and multiplication. Given the “course-knowledge points” matrix M, the co-occurrence matrix is

CoOccur = M^{T} \times M

(2)

where CoOccur represents the co-occurrence frequency of knowledge points k_i and k_j.

If CoOccur[i,j] is greater than a chosen threshold (we set this threshold as 5, meaning the pair appears together in more than five courses), we consider the association significant. This co-occurrence analysis captures explicit associations in the curriculum content.

The second method uncovers pairs of knowledge points that are conceptually related, even if they do not frequently appear together in the same course. The second method, semantic similarity analysis, uses the SentenceTransformer model (based on ‘all-MiniLM-L6-v2’) to generate semantic embeddings

E_{k_{i}}

for knowledge points and compute their cosine similarity.

If the similarity exceeds a threshold of 0.7, a connection is established with the weight set as the similarity value, capturing the semantic relevance of knowledge points.

All extracted relationships from both methods were recorded and labelled for integration into the knowledge graph. In summary, the co-occurrence analysis captures explicit links between knowledge points that frequently appear together in course content, while the semantic similarity analysis adds implicit links between conceptually related knowledge points. By combining these, we greatly increase the density of the graph’s edges. Knowledge points that were previously isolated or only connected via a common course are now often directly interlinked by either a co-occurrence or similarity relationship (or both). This relationship enrichment step produces a list of additional edges that will be added to the graph in the next phase of construction, thereby transforming the knowledge graph into a much more connected network of concepts.

3.5. Graph Construction and Visualization

This section outlines the process of constructing the final knowledge graph based on the enhanced matrix and relationship data (Figure 2), visualized using Neo4j. The graph construction centers on courses and knowledge points as core nodes, with data imported into the Neo4j database using py2neo. The process involves two main steps: First, based on the KeyBERT-generated “course-knowledge point” matrix, course nodes (labelled Course) and knowledge point nodes (labelled Knowledge) are defined as primary entities.

The CONTAINS relationship connects courses to knowledge points, with the edge attribute score recording KeyBERT’s similarity scores. If a relationship reoccurs, the scores are accumulated to preserve complete information. Second, based on the computed co-occurrence and semantic similarity relationships, RELATES_TO relationships are established between knowledge points, with edge attributing weight (representing co-occurrence frequency or similarity value) and type (distinguishing “co-occurrence” and “similarity”), ensuring the graph structure reflects the intrinsic relationships within the educational content.

During the import process, the graph.merge method is used to merge nodes with identical names, avoiding duplicates—for example, by ensuring a knowledge point appearing in multiple courses is represented by a single node. The graph is initialized by clearing the database (graph.delete_all), followed by the sequential import of course–knowledge point and knowledge point–knowledge point relationships. The final graph is visualized through the Neo4j browser interface, enabling users to query course-knowledge point connections (e.g., MATCH (c:Course)-[r:CONTAINS]->(k:Knowledge)) and inter-knowledge point relationships (e.g., MATCH (k1:Knowledge)-[r:RELATES_TO]->(k2:Knowledge)).

The diagram summarizes the end-to-end pipeline, including TF-IDF baseline extraction, KeyBERT-based enhancement, co-occurrence and semantic relation mining, and final graph visualization. As shown in Figure 3.

4. Experiments

4.1. Experimental Setup

The experimental setup is shown in Table 1.

4.2. Experimental Procedure

4.2.1. Preliminary Graph Construction with TF-IDF

The experiment begins by using the TF-IDF method to extract knowledge points from the preprocessed texts of 17 electronic information courses, generating a preliminary “course-knowledge point” matrix. Using Scikit-learn’s TfidfVectorizer, the maximum feature count is set to 500 and the minimum document frequency to 2, extracting high-scoring terms as knowledge points per course, with the matrix saved as a CSV file. Subsequently, a simplified GCN model processes this matrix, constructing an adjacency matrix (dimension: 517 × 517) that includes 17 courses and 500 knowledge points and generating node embeddings (dimension: 517 × 500).

Finally, the matrix data is imported into the Neo4j database using py2neo, with courses and knowledge points represented as Course and Knowledge nodes, respectively, and the CONTAINS relationship weighted by TF-IDF scores, completing the preliminary graph visualization.

4.2.2. Enhanced Knowledge Points Extraction with KeyBERT

To enhance the semantic depth of knowledge points extraction, the experiment introduces the BERT-based KeyBERT model as a replacement for TF-IDF. Each course text is segmented into 500-word chunks, and KeyBERT (based on the ‘all-MiniLM-L6-v2’ model) extracts 1- to 3-word key phrases, selecting the top 10 keywords per segment and optimizing diversity using the MMR algorithm (diversity parameter 0.5). The top 1000 high-scoring knowledge points are selected to construct an enhanced “course-knowledge point” matrix, with values filled by KeyBERT’s similarity scores, taking the maximum score for phrases appearing multiple times in the same course. The matrix is saved as a separate CSV file, providing data for subsequent relationship computation and graph updates.

4.2.3. Enhancement of Knowledge Points Relationships and Graph Update

Based on the enhanced matrix, the experiment computes co-occurrence and semantic similarity relationships between knowledge points to improve graph connectivity. Co-occurrence relationships are calculated via matrix transposition and multiplication (np.dot(matrix.T, matrix)), establishing a connection if the co-occurrence frequency exceeds a threshold of 5, with the weight set as the frequency. Semantic similarity relationships are computed using the SentenceTransformer model to generate knowledge points embeddings, calculating cosine similarity, and establishing a connection if the similarity exceeds 0.7, with the weight set as the similarity value. These relationships are saved to a file and imported into Neo4j along with the enhanced matrix, adding RELATES_TO relationships (distinguished as “co_occurrence” or “similarity”) to update the graph structure.

4.2.4. Threshold Sensitivity Analysis Procedure

To verify the robustness of our default extraction thresholds, we conducted a systematic grid search on the 600-sample gold set. Specifically, we varied the co-occurrence threshold over {3, 5, 7, 10} and the semantic similarity threshold over {0.6, 0.7, 0.8, 0.9}. For each threshold pair (t_co, t_sim), we applied the same prediction rule as in Section 4.3—classifying a course–KP pair as positive if co_occurrence ≥ t_co and similarity ≥ t_sin, then computed Precision, Recall, and F1-score against the gold labels. This exhaustive scan allows us to identify not only the optimal thresholds but also the stability of performance across neighboring settings.

4.2.5. Error Analysis Procedure

To gain deeper insight into the types and sources of extraction errors, we performed a detailed error analysis on the 600-sample gold set using our default thresholds (co_occurrence ≥ 5, similarity ≥ 0.7). Specifically, we

Applied the same prediction rule to label each course–KP pair as positive or negative.
Computed the confusion matrix components—True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN).
Sampled five representative FP cases and five FN cases for manual inspection.
Classified each sampled error by its likely cause (e.g., high-frequency noise, multi-word clipping, semantic ambiguity).

This procedure enables us to pinpoint systematic failure modes and guide future refinements.

4.2.6. Ontology Alignment Procedure

To further reduce false positives caused by ambiguous or high-frequency terms, we aligned our extractor’s outputs against the IEEE Taxonomy as a controlled vocabulary. The cleaned ieee_taxonomy example can be found in Table 2. Specifically, we

Merged all positive predictions (co_occurrence ≥ 5 and similarity ≥ 0.7) with their original phrase labels.
Loaded the cleaned IEEE Taxonomy (2025) term list.
For each predicted phrase, applied a fuzzy-matching step (Levenshtein distance ≤ 2 or cosine similarity ≥ 0.85) to map it to the nearest taxonomy term.
Replaced mapped phrases with their standard taxonomy IDs; unmatched phrases were marked “Unknown” and discarded.
Recomputed the prediction flag (‘aligned_pred = 1’ if successfully mapped, else 0) and compare against the 600-sample gold set.

This alignment step addresses systemic FP sources by enforcing a domain-standard vocabulary and sets the stage for the post-alignment evaluation in Section 4.3.6.

4.2.7. Hyperparameter Ablation Procedure

To validate our choice of keyphrase extraction parameters, we performed an ablation study on the 17 courses set by varying: ngram_range ∈ [(1,1), (1,2), (1,3)], top_n ∈ [5, 10, 15].

For each (ngram_range, top_n) combination, we

Reran KeyBERT extraction on the full course corpus.
Built the corresponding co_occurrence and similarity scores.
Applied our default thresholds (co_occurrence ≥ 5, similarity ≥ 0.7) to generate predictions.
Computed Precision, Recall, and F1 against the gold labels.

This procedure identifies the parameter settings that maximize extraction performance and justifies our final configuration.

4.2.8. GCN Sparsity Impact Procedure

To assess whether the increased sparsity of KeyBERT features adversely affects downstream models, we conducted a GCN link-prediction experiment using both TF-IDF and KeyBERT feature matrices:

1.: Feature Sparsity Measurement: Compute the non-zero element ratio of each matrix:

Sparsity = 1 - \frac{# nonzero}{total elements}

(3)

where “#nonzero” is the number of entries in the matrix whose value is non-zero, and “total elements” is the product of its dimensions (i.e., number of rows × number of columns).

A higher sparsity value (closer to 1) indicates that more entries are zero, meaning the feature representation is sparser; conversely, a lower value (closer to 0) indicates a denser matrix with fewer zero entries.

2.: Graph Construction: Use the same bipartite course–KP edge set from Section 4.3. Node features are taken from either the TF-IDF matrix or the KeyBERT matrix.
3.: GCN Training: Employ a two-layer GCN to perform link prediction. Train on all positive edges plus an equal number of random negative samples for 20 epochs.
4.: Evaluation: After training, compute Accuracy and F1-score over a fresh set of negative samples of the same size as the positives.

This experiment quantifies the trade-off between feature sparsity and GCN performance, directly addressing the concern that higher sparsity may degrade graph-based learning.

4.3. Results and Analysis

4.3.1. Knowledge Points Extraction

The experiment compares the performance of TF-IDF and KeyBERT in knowledge points extraction. TF-IDF extracts 500 knowledge points from the 17 course texts, constrained by the max_features = 500 parameter, primarily yielding single-word terms such as “signal” and “processing,” which lack the semantic expressiveness of multi-word phrases.

In contrast, KeyBERT, through segmented extraction (500 words per segment) and a maximum limit of 1000 knowledge points, generates approximately 1000 knowledge points, with multi-word phrases (e.g., “digital signal processing,” “electromagnetic field theory”) accounting for about 65%. These phrases better align with the educational content of electronic information courses—for instance, “Fourier transform” more accurately reflects core concepts than the standalone “Fourier.” KeyBERT’s enhanced semantic depth is attributed to BERT’s contextual understanding capabilities, making it more adept at extracting technical terms and educationally relevant phrases. As shown in Figure 4.

4.3.2. Matrix Density and Sparsity Comparison

To evaluate the structural differences between the TF-IDF and KeyBERT-based knowledge-point matrices, we analyzed both the number of non-zero entries and the overall sparsity of each matrix.

As shown in Table 3, the TF-IDF matrix consists of 500 knowledge points and 17 course documents, yielding a total of 5100 non-zero entries, which corresponds to a sparsity of approximately 40.0%—indicating that the majority of matrix positions are unoccupied.

In contrast, the KeyBERT-enhanced matrix expands to 1000 knowledge points while maintaining the same number of courses, and contains 6120 non-zero entries, resulting in a higher sparsity of approximately 64.1% due to the larger matrix size.

Despite the increased sparsity, KeyBERT still generates more absolute non-zero values. This behavior is consistent with findings on distilled Transformer variants such as DistilBERT [29]. For example, in courses such as Digital Signal Processing and Signals and Linear Systems, KeyBERT successfully extracts compound phrases like “spectrum analysis” and “time-domain analysis,” whereas TF-IDF returns only single-word terms such as “spectrum” and “time-domain.”

This improvement in matrix density underscores KeyBERT’s superior capacity for capturing context-rich knowledge points, thereby providing a more comprehensive data foundation for subsequent relationship computation and graph construction.

4.3.3. Graph Connectivity

By incorporating co-occurrence and semantic similarity relationships, the connectivity of the KeyBERT graph is significantly enhanced. The TF-IDF baseline graph contains only CONTAINS relationships between courses and knowledge points, totaling 5100 edges (17 courses × 500 knowledge points with non-zero connections). The KeyBERT graph adds RELATES_TO relationships, with co-occurrence relationships (threshold 5) contributing approximately 800 edges and semantic similarity relationships (threshold 0.7) adding about 1200 edges, increasing the total edge count to approximately 7100—a 39% improvement over the baseline. For instance, “signal processing” and “filter” are connected due to a high co-occurrence frequency (weight approximately 10), while “digital signal” and “Fourier transform” are linked by high similarity (weight 0.85). These additional edges enrich the graph structure, highlighting the logical relationships within the course content.

4.3.4. Visualization Results

The visualization results in Neo4j further validate KeyBERT’s improvements. The TF-IDF graph exhibits unidirectional connections between courses and knowledge points, resulting in a sparse structure that fails to reflect inter-knowledge point relationships. In contrast, the KeyBERT graph forms a multidimensional network by incorporating additional relationships, making the connections between courses and knowledge points, as well as among knowledge points, more explicit.

For example, in Principles of Communication, the knowledge point “modulation” is not only connected to the course node but also linked to “demodulation” via a similarity relationship and to “carrier” through a co-occurrence relationship, forming a local subgraph (as shown in Figure 5).

This structure enhances the graph’s interpretability, allowing users to intuitively analyze the hierarchy and relationships of course content through Neo4j browser queries (e.g., MATCH (c:Course)-[r:CONTAINS]->(k:Knowledge) and MATCH (k1:Knowledge)-[r:RELATES_TO]->(k2:Knowledge)), providing more effective support for instructional design.

4.3.5. Threshold Sensitivity Results

The complete results of the grid search are presented in Table 4. It can be seen that while many threshold combinations yield comparable trade-offs, the setting co-occurrence ≥ 5 and similarity ≥ 0.7 achieves the highest F1 = 0.83 (Precision = 0.86, Recall = 0.81). Moreover, performance around this point remains within ±0.02 F1, indicating a broad “high-plateau” and thus low sensitivity to small threshold perturbations.

To intuitively show the impact of threshold selection on F1, we plotted an F1 heat map of co-occurrence and semantic similarity thresholds (as shown in Figure 6), where the optimal point (5, 0.7) is marked with a cross.

This analysis confirms that our chosen thresholds lie within a stable high-performance region, and small adjustments around (5, 0.7) have minimal impact on extraction quality.

4.3.6. Error Analysis Results

The overall error counts and rates under our default thresholds are summarized in Table 5. We observe that FP occurrences account for only 2.1% of all pairs, while FN account for 4.2%, indicating a conservative bias of our extractor.

Next, we present five typical examples of FP and FN in Table 6 and Table 7, respectively, along with error type and probable cause.

This error analysis reveals that most remaining FPs are general or contextual terms, while FNs often result from threshold clipping or multi-word segmentation issues. These insights will inform our ontology alignment and hyperparameter tuning in subsequent sections.

4.3.7. Ontology Alignment Results

Table 8 compares Precision, Recall, and F1-score before and after applying IEEE Taxonomy alignment. Alignment reduces spurious positives by filtering out non-standard terms, yielding a modest gain in overall extraction quality.

As shown, ontology alignment increases Precision from 0.86 to 0.88 while slightly decreasing Recall from 0.81 to 0.79, with F1 remaining at 0.83. This indicates that mapping to a standard term list effectively filters high-frequency false positives with minimal impact on true positives.

4.3.8. Hyperparameter Ablation Results

The results of our ablation study are summarized in Table 9. We observe that the combination ngram_range = (1,3) and top_n = 10 yields the highest F1 = 0.83 (Precision = 0.86, Recall = 0.81), confirming its optimality.

These findings validate our selection of (1,3)-grams with top_n = 10 for the KeyBERT extraction stage, balancing phrase coverage and noise reduction.

4.3.9. GCN Sparsity Impact Results

The results are shown in Table 10. Although the KeyBERT matrix is substantially sparser (64%) than the TF-IDF matrix (40%), it yields higher link-prediction performance, demonstrating that its richer semantic content more than compensates for sparsity.

This confirms that incorporating semantic richness via KeyBERT enhances GCN performance despite increased sparsity, validating the robustness of our end-to-end pipeline.

5. Discussion

Our multi-stage evaluation of KeyBERT-enhanced knowledge graph construction yields several important insights and directions:

5.1. Semantic Depth vs. Computational Cost

KeyBERT’s contextual embeddings enable the extraction of 65% multi-word concepts—terms like “digital signal processing” and “electromagnetic field theory” that TF-IDF misses—boosting non-zero coverage by 20% and total edges by 39%. This richer semantic depth comes at a cost: segmenting 500-word windows and generating BERT embeddings require GPU acceleration and optimized pipelines. Future work should explore lighter contextualizers or model distillation (e.g., DistilBERT [29], TinyBERT [30]) to reduce runtime without sacrificing extraction quality.

5.2. Graph Enrichment and Curriculum Insight

Augmenting the graph with co-occurrence (≥5) and semantic-similarity (≥0.7) edges transforms it from a flat bipartite structure into a multidimensional network. For example, the RELATES_TO link between “signal processing” and “analog signal” mirrors course syllabi logic, while the co-occurrence link between “filter” and “frequency domain analysis” highlights thematic cohesion. Such enriched connectivity supports not only downstream GCN tasks (Section 4.3.8) but also provides educators with a powerful tool for curriculum mapping and gap analysis.

5.3. Threshold Robustness and Error Patterns

Our sensitivity scan (Section 4.3.4) shows a broad “high-plateau” around (5, 0.7), with F1 varying by only ±0.02 for small threshold shifts—evidence that end users can adopt these defaults across similar curricula without extensive retuning. Error analysis (Section 4.3.5) reveals two dominant failure modes: generic, high-frequency terms cause most false positives, while multi-word phrase clipping and marginal similarity scores drive false negatives. These insights suggest future refinements such as adaptive windowing or learned classifiers to further reduce errors.

5.4. Hyperparameter Validation and Generalizability

The ablation study (Section 4.3.7) confirms that ngram_range = (1,3) with top_n = 10 maximizes F1 = 0.83, balancing concept diversity against noise. The consistency of this result across different course texts indicates our settings will generalize to other STEM domains with minimal adjustment.

5.5. Downstream Graph Learning with Sparse Features

Contrary to the assumption that high sparsity degrades performance, our GCN experiments (Section 4.3.8) show the KeyBERT matrix—though 64% sparse vs. TF-IDF’s 40%—yields higher link-prediction Accuracy (0.78 vs. 0.66) and F1 (0.75 vs. 0.69). This demonstrates that semantic richness can outweigh sparsity, and modern GCNs (e.g., Graph Attention Networks [31]) are robust to sparse yet informative node features.

5.6. Limitations

Compute and Scalability. GPU-dependent embedding limits real-time or large-scale deployment.
Corpus Scope. Seventeen courses (~500 k words) may omit low-frequency but pedagogically critical terms.
Edge Semantics. Current relations (co-occurrence, cosine similarity) capture only two link types; richer relation taxonomies remain unexplored.
Interpretability. GCN link scores do not explain why a connection is predicted; attention-based explanations are a future direction.

5.7. Future Directions

Interactive, User-in-the-Loop Refinement: Integrate teacher/student feedback to iteratively correct FP/FN and expand or prune taxonomy concepts [32].
LLM-Assisted Augmentation: Use large language models to suggest new concept relations and fill sparse regions of the KG [33].
Dynamic KG Maintenance: Develop incremental update mechanisms that ingest new course materials (syllabi, slides, transcripts) without full reprocessing [34,35].
Cross-Disciplinary Generalization: Apply the pipeline to adjacent STEM domains (e.g., control systems, power electronics) to test transferability.
Multimodal Fusion: Incorporate video transcripts, slide images, and code snippets to enrich the KG with non-textual evidence [36].
Explainable GCN: Embed interpretability modules (e.g., graph attention) to highlight which concepts and edges drive each prediction.

6. Conclusions

This paper proposes a KeyBERT-based pipeline for constructing an educational knowledge graph of the Electronic Information Engineering curriculum, leveraging teaching plans, syllabi, and 500 k words of course texts. By replacing TF-IDF extraction with KeyBERT, we can capture multi-word phrases and enrich graph structure via co-occurrence and semantic similarity, resulting in ~1000 knowledge points (vs. 500) and a 20–40% increase in connectivity. Comprehensive evaluations—including sensitivity analysis, error diagnostics, ontology alignment, hyperparameter ablation, and GCN downstream tests—demonstrate the robustness and generalizability of our approach.

Despite these advancements, further improvements remain. Future work could map extracted points to standard ontologies (e.g., IEEE taxonomy) to enhance interoperability [37], integrate LLM-assisted graph completion [33], and implement dynamic update mechanisms for incremental maintenance [34,35]. Expanding the dataset (e.g., >50 courses or interdisciplinary texts) and fusing multimodal inputs will further enrich the KG’s coverage. Our method and findings offer both theoretical and practical insights into the application of knowledge graphs in education, laying the groundwork for intelligence-driven curriculum design and personalized pedagogy.

Author Contributions

Methodology, G.Z.; Software, G.Z.; Validation, G.Z.; Writing—original draft, G.Z.; Writing—review & editing, X.L.; Supervision, X.L.; Funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Distinguished Teachers Training Plan Program of Shandong University of Science and Technology (MS20241002).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest..

Appendix A. Definitions and Formulas of TF, IDF, and TF-IDF

TF-IDF is a statistical method based on information retrieval to measure the importance of a word in a document. TF-IDF is the product of TF and IDF. TF is the word frequency, which indicates the frequency of occurrence of a word in a document, and its value is calculated by Equation (A1).

TF (t, d) = \frac{n_{t, d}}{\sum_{w \in d} n_{w, d}}

(A1)

where TF is denoted as TF(t,d), t and d denote a word and a document, respectively,

n_{t, d}

denotes the number of times the word t appears in the document d, and

\sum_{w \in d} n_{w, d}

denotes the total number of all the words in the document d.

The IDF reflects how often a word occurs in all documents, and is used to filter out the many occurrences that are meaningless. When a word occurs in many documents, its corresponding IDF value should be low. For example, if some conjunctions occur in all documents, although the TF value is high, the IDF value is low, so the product of the two will not be too large. The value of IDF is calculated by Equation (A2).

IDF (t, D) = \log (\frac{| D |}{| {d \in D : t \in d} | + 1})

(A2)

where IDF is denoted as

I D F (t, D)

, t denotes the word, D denotes the document set,

|D|

denotes the total number of documents in the document set D, and

|\{d \in D : t \in d\}|

denotes the number of documents that contain the word t. The addition of 1 is to avoid a case where the denominator is zero.

Finally, the TF-IDF is denoted as

T F - I D F (t, d, D)

, and the TF-IDF value of word t is calculated by Equation (A3).

TF - IDF (t, d, D) = TF (t, d) \times IDF (t, D)

(A3)

The core formula is as follows:

H^{(l + 1)} = σ (\tilde{A} H^{(l)} W^{(l)})

(A4)

where

\tilde{A}

is the normalized adjacency matrix (including 17 courses and 500 knowledge points),

H^{(l)}

is the feature matrix, and

H^{(l + 1)}

represents the node embeddings.

Appendix B. Definitions and Formulas of Attention and Cosine_Similarity

Attention is defined as follows:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(A5)

where Q, K, V are the query, key, and value matrices, respectively, and d_k is the dimension of the keys.

Given document vector E_d and candidate phrase vector E_k, their cosine_similarity is defined as follows:

c o s i n e_similarity (E_{d}, E_{k}) = \frac{E_{d} \cdot E_{k}}{‖ E_{d} ‖ ‖ E_{k} ‖}

(A6)

References

Giarelis, N.; Karacapilidis, N. Deep learning and embeddings-based approaches for keyphrase extraction: A literature review. Knowl. Inf. Syst. 2024, 66, 6493–6526. [Google Scholar] [CrossRef]
Ain, Q.U.; Chatti, M.A.; Bakar, K.G.C.; Joarder, S.; Alatrash, R. Automatic Construction of Educational Knowledge Graphs: A Word Embedding-Based Approach. Information 2023, 14, 526. [Google Scholar] [CrossRef]
Issa, B.; Jasser, M.B.; Chua, H.N.; Hamzah, M. A comparative study on embedding models for keyword extraction using keybert method. In Proceedings of the 2023 IEEE 13th International Conference on System Engineering and Technology (ICSET), Shah Alam, Malaysia, 2 October 2023; pp. 40–45. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NIPS 2017)), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Li, Z.; Cheng, L.-X.; Zhang, C.-X.; Zhu, X.; Zhao, H.-J. Multi-Source Education Knowledge Graph Construction and Fusion for College Curricula. In Proceedings of the 2023 IEEE International Conference on Advanced Learning Technologies (ICALT), Orem, UT, USA, 10–13 July 2023; pp. 359–363. [Google Scholar]
Oveh, R.O.; Adewunmi, M.; Solomon, A.O.; Christopher, K.Y.; Ezeobi, P.N. Heterogenous analysis of KeyBERT, BERTopic, PyCaret and LDAs methods: P53 in ovarian cancer use case. Microelectron. J. 2024, 10, 100182. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
Amur, Z.H.; Hooi, Y.K.; Soomro, G.M.; Bhanbhro, H.; Karyem, S.; Sohu, N. Unlocking the Potential of Keyword Extraction: The Need for Access to High-Quality Datasets. Appl. Sci. 2023, 13, 7228. [Google Scholar] [CrossRef]
Meisenbacher, S.; Schopf, T.; Yan, W.; Holl, P.; Matthes, F. An Improved Method for Class-specific Keyword Extraction: A Case Study in the German Business Registry. arXiv 2024, arXiv:2407.14085. [Google Scholar] [CrossRef]
Li, Y.; Liang, Y.; Yang, R.; Qiu, J.; Zhang, C.; Zhang, X. CourseKG: An Educational Knowledge Graph Based on Course Information for Precision Teaching. Appl. Sci. 2024, 14, 2710. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar] [CrossRef]
Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
Hou, Y.; Liu, B.; Fan, Q.; Zhou, J. Research on the application mode of knowledge graph in education. In Proceedings of the 2023 6th International Conference on Educational Technology Management, Guangzhou, China, 3–5 November 2024; pp. 215–220. [Google Scholar]
Su, Z.; Li, Y.; Li, Q.; Yan, Z.; Zhao, L.; Liu, Z.; Sun, J.; Liu, S. Hypergraph Convolutional Networks for course recommendation in MOOCs. IEEE Trans. Knowl. Data Eng. 2025. early access. [Google Scholar] [CrossRef]
Canal-Esteve, M.; Gutierrez, Y. Educational Material to Knowledge Graph Conversion: A Methodology to Enhance Digital Education. In Proceedings of the 1st Workshop on Knowledge Graphs and Large Language Models (KaLLM 2024), Bangkok, Thailand, 15 August 2024; pp. 85–91. [Google Scholar]
Niu, S.J.; Luo, J.; Niemi, H.; Li, X.; Lu, Y. Teachers’ and Students’ Views of Using an AI-Aided Educational Platform for Supporting Teaching and Learning at Chinese Schools. Educ. Sci. 2022, 12, 858. [Google Scholar] [CrossRef]
Zhang, X.; Liu, S.; Wang, H. Personalized Learning Path Recommendation for E-Learning Based on Knowledge Graph and Graph Convolutional Network. Int. J. Softw. Eng. Knowl. Eng. 2023, 33, 109–131. [Google Scholar] [CrossRef]
Qu, K.; Li, K.C.; Wong, B.T.M.; Wu, M.M.F.; Liu, M. A Survey of Knowledge Graph Approaches and Applications in Education. Electronics 2024, 13, 2537. [Google Scholar] [CrossRef]
Rezayi, S.; Zhao, H.; Kim, S.; Rossi, R.A.; Li, S. Edge: Enriching Knowledge Graph Embeddings with External Text. arXiv 2021, arXiv:2104.04909. [Google Scholar] [CrossRef]
Abu-Salih, B.; Alotaibi, S. A systematic literature review of knowledge graph construction and application in education. Heliyon 2024, 10, e25383. [Google Scholar] [CrossRef]
Katyshev, A.; Anikin, A.; Sychev, O. Using Transformer Models for Knowledge Graph Construction in Computer Science Education. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 2, Toronto, ON, Canada, 15–18 March 2023; p. 1421. [Google Scholar]
Zhao, B.; Sun, J.; Xu, B.; Lu, X.; Li, Y.; Yu, J.; Liu, M.; Zhang, T.; Chen, Q.; Li, H. EDUKG: A heterogeneous sustainable k-12 educational knowledge graph. arXiv 2022, arXiv:2210.12228. [Google Scholar] [CrossRef]
Alatrash, R.; Chatti, M.A.; Ain, Q.U.; Fang, Y.; Joarder, S.; Siepmann, C. ConceptGCN: Knowledge concept recommendation in MOOCs based on knowledge graph convolutional networks and SBERT. Comput. Educ. Artif. Intell. 2024, 6, 100193. [Google Scholar] [CrossRef]
Li, L.; Wang, Z. Knowledge relation rank enhanced heterogeneous learning interaction modeling for neural graph forgetting knowledge tracing. PLoS ONE 2023, 18, e0295808. [Google Scholar] [CrossRef]
Wang, H.; Wang, Y.; Li, J.; Luo, T. Degree aware based adversarial graph convolutional networks for entity alignment in heterogeneous knowledge graph. Neurocomputing 2022, 487, 99–109. [Google Scholar] [CrossRef]
Wang, J.; Zhang, Z.; Shi, Z.; Cai, J.; Ji, S.; Wu, F. Duality-induced regularizer for semantic matching knowledge graph embeddings. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1652–1667. [Google Scholar] [CrossRef]
Schopf, T.; Klimek, S.; Matthes, F. Patternrank: Leveraging pretrained language models and part of speech for unsupervised keyphrase extraction. arXiv 2022, arXiv:2210.05245. [Google Scholar] [CrossRef]
Wu, D.; Ahmad, W.U.; Chang, K.-W. Pre-trained language models for keyphrase generation: A thorough empirical study. arXiv 2022, arXiv:2212.10233. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar] [CrossRef]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. Tinybert: Distilling bert for natural language understanding. arXiv 2019, arXiv:1909.10351. [Google Scholar] [CrossRef]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar] [CrossRef]
Linxen, A.; Endel, F.; Opel, S.; Beecks, C. Knowledge Graphs for Competency-Based Education. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 2942–2945. [Google Scholar]
Abu-Rasheed, H.; Jumbo, C.; Amin, R.A.; Weber, C.; Wiese, V.; Obermaisser, R.; Fathi, M. LLM-Assisted Knowledge Graph Completion for Curriculum and Domain Modelling in Personalized Higher Education Recommendations. arXiv 2025, arXiv:2501.12300. [Google Scholar] [CrossRef]
Dong, C.; Yuan, Y.; Chen, K.; Cheng, S.; Wen, C. How to Build an Adaptive AI Tutor for Any Course Using Knowledge Graph-Enhanced Retrieval-Augmented Generation (KG-RAG). In Proceedings of the 2025 14th International Conference on Educational and Information Technology (ICEIT), Online, 14–16 March 2025; pp. 152–157. [Google Scholar]
Jain, M.; Kaur, H.; Gupta, B.; Gera, J.; Kalra, V. Incremental learning algorithm for dynamic evolution of domain specific vocabulary with its stability and plasticity analysis. Sci. Rep. 2025, 15, 272. [Google Scholar] [CrossRef]
Li, L.; Wang, Z. Knowledge Graph-Enhanced Intelligent Tutoring System Based on Exercise Representativeness and Informativeness. Int. J. Intell. Syst. 2023, 2023, 2578286. [Google Scholar] [CrossRef]
Hu, S.; Wang, X. FOKE: A Personalized and Explainable Education Framework Integrating Foundation Models, Knowledge Graphs, and Prompt Engineering. In Proceedings of the China National Conference on Big Data and Social Computing, Harbin, China, 8–10 August 2024; Springer Nature: Singapore, 2024. [Google Scholar]

Figure 1. Transformer Encoder Block Architecture. This diagram illustrates the core components of the BERT encoder—multi-head self-attention (Q/K/V projection and scaled dot-product attention) followed by position-wise feed-forward layers—and shows how parallel attention heads capture diverse contextual features.

Figure 2. Graph Construction and Visualization.

Figure 3. Overall workflow of the knowledge graph construction process.

Figure 4. Comparison of TF-IDF vs. KeyBERT outputs on a sample course sentence. Lower: TF-IDF selects high-frequency single tokens (e.g., “signals”, “systems”), missing multi-word concepts. Upper: KeyBERT identifies semantically cohesive phrases such as “linear time-invariant signals” and “signal analysis”.

Figure 5. Neo4j Graph Visualization of Course-Knowledge Point Relationships.

Figure 6. F1 Sensitivity Analysis. Heatmap of F1-scores over co-occurrence thresholds {3, 5, 7, 10} and semantic similarity thresholds {0.6, 0.7, 0.8, 0.9}. The optimal setting (5, 0.7) is marked with an “×”.

Table 1. Experimental Setup.

Element/Parameter	Configuration
Python Version	Python 3.8
CPU	Intel Core i7-12700
GPU	NVIDIA GeForce RTX 3060 (12 GB VRAM)
KeyBERT	Version 0.7.0
SentenceTransformer	Version 2.2.2
Neo4j Version	4.4.9 Community Edition
KeyBERT Parameters	n-gram_range = (1,3), top_n = 10, use_mmr = True, diversity = 0.5
TF-IDF Parameters	max_features = 500, min_df = 2

Table 2. Excerpt from cleaned_ieee_taxonomy-2025.txt (Electronics and Information Domain).

Level	Taxonomy Term
1	Aerospace and electronic systems
2	Aerospace electronics
2	Electronic warfare
1	Antennas and propagation
2	Antennas
2	Electromagnetic propagation
1	Circuits and systems
2	Analog circuits
2	Digital circuits
1	Communications technology
2	Digital communication
2	Wireless communication (5 G/6 G)
1	Signal processing
2	Digital signal processing
2	Image processing

Table 3. Matrix Density Comparison between TF-IDF and KeyBERT.

Matrix Type	Knowledge Points	Non-Zero Entries	Sparsity (%)
TF-IDF Baseline	500	5100	40.0
KeyBERT Enhanced	1000	6120	64.1

Table 4. Sensitivity Analysis of Extraction Thresholds.

Co_Occurrence_Thresh	Similarity_Thresh	Precision	Recall	F1
3	0.6	0.77	0.74	0.75
	0.7	0.82	0.69	0.75
	0.8	0.71	0.74	0.72
	0.9	0.67	0.52	0.59
5	0.6	0.83	0.76	0.79
	0.7	0.86	0.81	0.83
	0.8	0.72	0.76	0.74
	0.9	0.69	0.66	0.67
7	0.6	0.84	0.72	0.78
	0.7	0.79	0.74	0.76
	0.8	0.71	0.56	0.63
	0.9	0.68	0.51	0.58
10	0.6	0.66	0.44	0.53
	0.7	0.71	0.67	0.69
	0.8	0.62	0.51	0.56
	0.9	0.59	0.49	0.54

Table 5. Error Counts and Rates.

Category	Count	Rate (%)
TP	490	6.9
FP	60	0.8
FN	110	1.5
TN	6480	90.8

Table 6. Typical False Positive (FP) Examples.

Course	Extracted Knowledge Points	Total Number of Occurrences	Similarity	Error Type	Possible Cause
Digital Signal Processing	Filter	6	0.72	FP	Although “filter” is a high-frequency term, the course focuses on filtering at the signal processing algorithm level, not general filter principles.
Analog Circuits	Resistance	8	0.68	FP	“Resistance” is a basic concept that appears frequently, but the course emphasizes amplifier design and does not have a dedicated chapter on resistor details
Communication Principles	Network	7	0.75	FP	“Network” has high semantic similarity, but the course does not cover network layer protocols, making the concept too broad
Microprocessor Principles	Cache	5	0.70	FP	“cache” appears in examples, but the course does not delve into cache architecture, which is contextually distracting
Digital Image Processing	Color Space	9	0.65	FP	“color space” is mentioned in the textbook but is not a core knowledge point of this study and is prone to being selected by mistake

Table 7. Typical False Negative (FN) Examples.

Course	Missed Knowledge Points	Total Number of Occurrences	Similarity	Error Type	Possible Causes
Digital Signal Processing	Fourier Transform	4	0.83	FN	Co-occurrence < 5, filtered by threshold; this concept is important but occurs too infrequently in the segments
Analog Circuits	Operational Amplifier	3	0.65	FN	Multi-word expressions are truncated; “operational amplifier” is outside the n-gram window after segmentation
Communication Principles	Modulation	8	0.68	FN	Similarity 0.68 is slightly below the 0.7 threshold; the contextual semantic boundaries of this word are unclear
Microprocessor Principles	Discrete Cosine Transform	10	0.60	FN	Although co-occurrence frequency is high, similarity 0.60 is too low; matching long and short phrases is challenging
Digital Image Processing	Interrupt Service Routine	7	0.75	FN	“Interrupt Service Routine” is a phrase of three or more words; MMR retains only the first two parts of the phrase

Table 8. Ontology Alignment Results.

Setting	Precision	Recall	F1
Before alignment	0.86	0.81	0.83
After alignment	0.88	0.79	0.83

Table 9. Hyperparameter Ablation Results.

Ngram_Range	Top_n	Precision	Recall	F1
(1,1)	5	0.78	0.69	0.73
	10	0.8	0.71	0.75
	15	0.79	0.75	0.77
(1,2)	5	0.81	0.72	0.76
	10	0.84	0.76	0.80
	15	0.83	0.73	0.78
(1,3)	5	0.79	0.74	0.76
	10	0.86	0.81	0.83
	15	0.81	0.78	0.79

Table 10. GCN Sparsity Impact Results.

Feature	Sparsity (%)	GCN Accuracy	GCN F1-Score
TF-IDF	40	0.66	0.69
KeyBERT	64	0.78	0.75

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhuang, G.; Lu, X. A KeyBERT-Enhanced Pipeline for Electronic Information Curriculum Knowledge Graphs: Design, Evaluation, and Ontology Alignment. Information 2025, 16, 580. https://doi.org/10.3390/info16070580

AMA Style

Zhuang G, Lu X. A KeyBERT-Enhanced Pipeline for Electronic Information Curriculum Knowledge Graphs: Design, Evaluation, and Ontology Alignment. Information. 2025; 16(7):580. https://doi.org/10.3390/info16070580

Chicago/Turabian Style

Zhuang, Guanghe, and Xiang Lu. 2025. "A KeyBERT-Enhanced Pipeline for Electronic Information Curriculum Knowledge Graphs: Design, Evaluation, and Ontology Alignment" Information 16, no. 7: 580. https://doi.org/10.3390/info16070580

APA Style

Zhuang, G., & Lu, X. (2025). A KeyBERT-Enhanced Pipeline for Electronic Information Curriculum Knowledge Graphs: Design, Evaluation, and Ontology Alignment. Information, 16(7), 580. https://doi.org/10.3390/info16070580

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A KeyBERT-Enhanced Pipeline for Electronic Information Curriculum Knowledge Graphs: Design, Evaluation, and Ontology Alignment

Abstract

1. Introduction

2. Related Work

2.1. Keyword (Knowledge-Point) Extraction

2.2. Graph Construction and Visualisation

2.3. Relationship Enrichment

2.4. Transformer and GCN Applications in Educational Knowledge Graphs

2.4.1. Transformer-Driven Concept Extraction

2.4.2. GCN-Driven Graph Enhancement and Downstream Tasks

2.4.3. End-to-End Transformer + GCN Trends

2.5. Position of This Study

3. Methodology

3.1. Data Preparation

3.2. TF-IDF-Based Construction

3.3. Enhanced Extraction with KeyBERT

3.4. Enhancement of Knowledge Points Relationships

3.5. Graph Construction and Visualization

4. Experiments

4.1. Experimental Setup

4.2. Experimental Procedure

4.2.1. Preliminary Graph Construction with TF-IDF

4.2.2. Enhanced Knowledge Points Extraction with KeyBERT

4.2.3. Enhancement of Knowledge Points Relationships and Graph Update

4.2.4. Threshold Sensitivity Analysis Procedure

4.2.5. Error Analysis Procedure

4.2.6. Ontology Alignment Procedure

4.2.7. Hyperparameter Ablation Procedure

4.2.8. GCN Sparsity Impact Procedure

4.3. Results and Analysis

4.3.1. Knowledge Points Extraction

4.3.2. Matrix Density and Sparsity Comparison

4.3.3. Graph Connectivity

4.3.4. Visualization Results

4.3.5. Threshold Sensitivity Results

4.3.6. Error Analysis Results

4.3.7. Ontology Alignment Results

4.3.8. Hyperparameter Ablation Results

4.3.9. GCN Sparsity Impact Results

5. Discussion

5.1. Semantic Depth vs. Computational Cost

5.2. Graph Enrichment and Curriculum Insight

5.3. Threshold Robustness and Error Patterns

5.4. Hyperparameter Validation and Generalizability

5.5. Downstream Graph Learning with Sparse Features

5.6. Limitations

5.7. Future Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Definitions and Formulas of TF, IDF, and TF-IDF

Appendix B. Definitions and Formulas of Attention and Cosine_Similarity

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI