The knowledge graph construction includes the construction of both the schema layer and the data layer. The schema layer of the knowledge graph can make the relationships between knowledge concepts more logical and form a relatively complete system. Therefore, the intelligent construction of the schema layer based on domain knowledge is the first step in knowledge graph construction. The data layer of the knowledge graph is its core component. And it stores the specific information of all entities, relations, and attributes within the knowledge graph. The quality of the data layer construction directly affects the completeness and accuracy of the knowledge graph. When constructing the data layer, it is necessary to extract structured knowledge from a large amount of textual data. In this paper, this process includes named entity recognition and relation extraction. Next, knowledge fusion is designed to remove redundant knowledge. Finally, knowledge is stored using a graph database.
4.2. Named Entity Recognition
In this work, the BiLSTM-CRF model is innovated to identify entities within the subject domain. The residual blocks from the ResNet network [
38] are introduced to improve the BiLSTM-CRF entity recognition. And a residual block can strengthen the representational capacity of sequence features. A lightweight residual block is adopted with two 3 × 1 1D convolutional layers (kernel size
, padding
, stride
), where the number of filters is consistent with the BiLSTM hidden layer dimension. Unlike simply stacking more BiLSTM layers, which often leads to gradient vanishing and overfitting in deep networks, the residual block introduces shortcut connections to preserve gradient flow, enabling more efficient capture of local contextual patterns between adjacent words. The text data of high school mathematics are mapped into fixed-dimensional word vectors. BiLSTM and ResNet serve as encoders to extract features. Finally, the sentences of the text data are decoded with sequence labeling by the CRF layer. The improved BiLSTM-CRF model with an added residual block is displayed in
Figure 2a, and an example to illustrate the model is also shown in
Figure 2b.
In order to explain the structure of the innovated BiLSTM-CRF model for named entity recognition, the five main structures of
Figure 2a are elaborated as follows:
(1) Word embedding
Word embedding is used to convert text into vectors. This step is the foundational part of named entity recognition. As shown in
Figure 2a, the word embedding layer converts the input text into vector form. Denote the sentence text by
, and the vector by
, where
and
represent the T-th character in the sentence and its vector, respectively. Then, the word embedding is utilized for the purpose of capturing features from the domain knowledge text.
(2) BiLSTM layer
Usually, BiLSTM is used to capture contextual information when extracting features of domain knowledge text. BiLSTM includes three types of gates by which the text is regulated: forget gate, input gate, and output gate. Suppose the
t words are translated to vectors, the input word vector is
, denote the cell state
, hidden layer state
, forget gate
, input gate
and output gate
respectively. The internal diagram of LSTM is shown in
Figure 3a.
Then the value of the forget gate is
The degree of forgetting information is controlled by
in the forgetting gate. The value of input gate
And the value of temporary cell state
The value of the current cell state can be obtained,
where the input values of the hidden state and cell state in the previous moment
are
and
respectively, and
,
are the weight matrices of the forget gate, input gate, the current cell state, respectively. By the above formulas and the previous method in reference [
27], the value of output gate is
Taking the values
and
as inputs, a sequence of hidden layer states
on the given sentence is obtained, the formula is
. That is, the BiLSTM layer outputs the hidden state of the given sentences.
The word vectors taken as embedding layer are input to the BiLSTM layer, then the contextual information of each character is extracted by the BiLSTM model. Since the BiLSTM model consists of two opposite direction LSTMs, denote the forward and the backward state vectors of the t-th time step by and respectively, where and . Then, the final output hidden states of BiLSTM layer are . Although the BiLSTM layer can capture the contextual information during the processing of sequence data, it has limitations in capturing fine-grained local information and dealing with deep networks. Hence, the residual block is introduced from the ResNet network, which improves the training of deep networks by learning residuals to conquer the above shortcomings. Traditional BiLSTM can effectively capture long-range contextual dependencies when processing sequential data. However, in subject-domain texts, a large number of professional terms consist of consecutive adjacent words, and the accurate recognition of such entities relies heavily on semantic features within local windows. BiLSTM is insufficient in modeling the local contextual information between words. To alleviate this limitation, this paper introduces a lightweight ResNet structure to enhance local feature extraction. Its core role is not to solve the vanishing gradient problem of deep networks through residual connections, but to capture local semantic correlations between adjacent words via convolution operations and extract structural features of continuous word combinations, adapting to the entity expression patterns of subject terminologies. Meanwhile, batch normalization and shortcut connections in the residual module stabilize the training process of local convolutional features, avoid overfitting under small-sample domain data, and improve the generalization performance of entity recognition.
Before formally analyzing the complementary relationship between BiLSTM and ResNet, we first clarify the connotation of “local contextual information” in subject domain named entity recognition and the inherent limitations of alternative local feature extraction methods. In domain-specific NER tasks, local contextual information specifically refers to the semantic association and boundary discriminative features between consecutive words within multi-word technical terms. For example, in high school mathematics texts, terms such as “quadratic equation with one unknown”, “eccentricity of an ellipse”, and “probability mass function of binomial distribution” are all composed of 3–5 consecutive words. The accurate recognition of these entities does not mainly depend on the global context of the entire sentence, but on the tight semantic combination relationship between adjacent words in the local window.
TextCNN, as a traditional local feature extraction method, uses fixed-size convolution kernels and lacks a built-in training stabilization mechanism, making it extremely prone to overfitting on small-scale domain datasets with strong specificity. The results of the ablation study in
Section 5.2 directly validate this limitation: adding only convolutional layers to the baseline BiLSTM-CRF model resulted in little improvement in F1 score, indicating that pure convolutional operations cannot effectively extract discriminative local features under small sample conditions. Dilated convolutions, although they can expand the receptive field without increasing computational complexity, achieve this by skipping intermediate words, which leads to the sparsification of local features. This characteristic is particularly harmful to the recognition of densely packed short domain terms, as it may lose the key semantic connection between adjacent words that defines the entity boundary. In contrast, the lightweight ResNet block integrates convolutional layers, batch normalization, and residual connections. The convolutional layers extract hierarchical local n-gram features, batch normalization reduces internal covariate shift to prevent overfitting, and the residual connection preserves original feature information to stabilize the training process, making it the most suitable local feature extractor for subject domain NER tasks.
The BiLSTM and ResNet are connected serially with residual shortcut connections. The BiLSTM first captures global long-distance contextual features, and its output hidden states are then fed into the ResNet residual block to extract local fine-grained features. The residual shortcut preserves global information while enhancing local feature learning, forming an integrated feature extractor before the final CRF layer. This post-positioned design of ResNet after BiLSTM is tailored to subject domain text traits. Our design lets BiLSTM encode global contextual semantics to resolve the polysemy of domain terms, e.g., “axis” has distinct meanings in geometry and function scenarios. The ResNet block then mines fine-grained local features on context-aware hidden states, which sharply boosts the recognition of multi-word technical terms like “quadratic equation with one unknown” and “eccentricity of an ellipse”.
(3) Residual block
The ResNet is composed of the residual block, which consists of two convolutional layers and batch normalization layers, preserving the input information through shortcut connections. In the residual block, the core structure is the equation , where is defined as the residual mapping, and is the output of the previous layer. In this structure, the input h maps to the output of the network through the shortcut connection. If the value of the optimal mapping function is required, the residual relative to the input is . The convolution operation within the residual block can effectively capture local patterns and fine-grained features. The convolution operation applies a sliding window to the local area of the input feature matrix. This can capture the local contextual relationships between words, extract features between adjacent words, and help identify common structures in continuous words. By stacking convolutional layers, ResNet can capture local information at different scales, thereby improving the capability to express specifics. Meanwhile, batch normalization can make network training more stable and accelerate convergence speed. The shortcut connection in the residual block ensures that gradients can flow from the output to the input without hindrance, thus ensuring effective gradient propagation in deep networks. The BiLSTM-ResNet outputs a feature matrix M. The features generated by the BiLSTM-ResNet are fed into a linear layer to be transformed into the output matrix P required by the CRF. Assuming there are k output labels, the output dimension of the linear layer will be k. Thus, the score matrix of the linear layer is obtained as P, and its dimension is .
(4) CRF layer
Adding a CRF layer after the BiLSTM-ResNet part is to extract text features and correct the identified label sequence. Generally, CRF is used to model the dependencies between labels and perform sequence labeling in a global scope. CRF defines a transition matrix to represent the transition probabilities between labels and uses the Viterbi algorithm to find the global optimal sequence. Denote the label sequence by
, the final score function of the CRF layer denoted by
s will be obtained by the influence of the input sequence
e and the label sequence
y,
where
is the score for the
t-th token mapped to the label
, and
represents the transition score from
to
. The CRF layer obtains the final scoring function
s by combining the output matrix
P from the previous layer with the transition matrix
A, and outputs the optimal label sequence. Thus, in the CRF layer, the score function
s is used to determine the conditional probability of a specific label sequence corresponding to a given input sequence. Then, based on these probabilities, a logarithmic likelihood function can be obtained. Finally, the Viterbi algorithm is applied to decode the output sequence.
(5) Example for obtaining named entity recognition
A specific example is proposed to illustrate the processes in
Figure 2a and
Figure 3a. First, the input sequence is set, and its characters are represented in vector form. Then, the weight matrices
,
,
,
and bias vectors
,
,
,
are defined. Taking hidden state dimension
and label size
as an example, these parameters and the formulas mentioned in the previous text are used to calculate the hidden state vectors
,
,
. The calculation example of
is shown in
Figure 3b. The weight matrices used for forward and backward calculations are different, but the same calculation method can also be used to calculate
,
,
, ultimately obtaining
,
,
. (The specific calculation example is not reiterated here.) Finally, the output
h of the BiLSTM is processed through the residual block and the linear layer to obtain the score matrix P. Taking the input word vector
and initial hidden state
, cell state
as inputs, we calculate the forget gate
, input gate
, output gate
, temporary cell state
, final cell state
and hidden state
using the LSTM gate mechanism formulas. This calculated hidden state
(along with the corresponding backward LSTM hidden state) forms the output of the BiLSTM layer, which is then fed into the subsequent residual block. The residual block will perform convolution operations on this sequence of hidden states to capture the local contextual relationships between adjacent tokens. This numerical example illustrates how the BiLSTM layer first extracts global contextual features, which are then refined by the residual block to capture fine-grained local patterns. This is exactly the core design principle that enables our BiLSTM-ResNet-CRF model to outperform the traditional BiLSTM-CRF model in recognizing multi-word mathematical entities. By
Figure 2b, it can be found that the output results may violate naming conventions. To address this issue, the output violating naming conventions is addressed after the CRF layer is inserted before the output.
The errors in the high school mathematics test sets mainly include three types: entity boundary detection errors, entity type classification errors, and mixed boundary-type errors. First, boundary errors arise from the prevalence of compound terms formed by combining multiple basic mathematical concepts with ambiguous semantic boundaries between them. For example, one may identify “quadratic equation with one unknown” as a complete entity while omitting “root-finding formula” in the phrase “root-finding formula for quadratic equations with one unknown”, or incorrectly split “eccentricity of an ellipse” into two separate entities when it should form a single geometric attribute entity. Second, type errors are caused by the semantic overlap of certain terms across different knowledge modules and insufficient annotated samples for low-frequency entities. For example, “permutation and combination” may be misclassified as a “numbers and algebra” entity instead of the correct “probability and statistics” category, and “mapping” may be misclassified as “preparatory knowledge” instead of “numbers and algebra”. Third, mixed errors result from the combined effects of these two factors. For example, for the phrase “probability mass function of a binomial distribution”, one may not only incorrectly split it into two entities but also misclassify “probability mass function” as “Preparatory Knowledge”.
4.3. Theoretical Rationale for BiLSTM-ResNet-CRF Integration
The superior performance of our proposed BiLSTM-ResNet-CRF model in subject domain NER stems from the complementary strengths of BiLSTM and ResNet components, which together form a multi-scale feature extraction framework that addresses the limitations of standalone BiLSTM-CRF models. A formal theoretical analysis of this complementary relationship is provided below.
(1) Gradient flow preservation via residual connections
A fundamental limitation of deep neural networks is the vanishing gradient problem, which becomes particularly severe when training on small-scale domain datasets. The residual block introduces an identity shortcut connection that allows gradients to flow directly from the output layer to earlier layers during backpropagation. Formally, the gradient of the loss function
L with respect to the input
h of a residual block is
where
is the residual mapping. Even when
approaches 0 (vanishing gradient), the gradient remains non-zero due to the identity term 1. This property ensures that our model can learn discriminative features from limited subject domain data without performance degradation, which is critical for domains where high-quality labeled data are scarce.
(2) Multi-scale local feature extraction via convolutional layers
BiLSTM models excel at capturing long-range sequential dependencies but are inefficient at extracting fine-grained local contextual features. The convolutional layers in ResNet blocks apply sliding windows of size
I to the input feature matrix, enabling the model to explicitly model adjacent word dependencies at different granularities. For a sequence of length
T, the output feature map of a convolutional layer with kernel size
I, padding
P, and stride
L has length:
By stacking multiple convolutional layers with different kernel sizes ( and are used in the implementation), this model captures hierarchical local patterns ranging from bigram and trigram features to phrase-level structures. This is particularly valuable for recognizing multi-word technical terms in subject domains, where accurate boundary detection depends on identifying local semantic patterns between consecutive words.
(3) Training stabilization via batch normalization
The batch normalization (BN) operation included in each residual block normalizes the activations of the previous layer to have zero mean and unit variance:
where
and
are the mean and variance of the mini-batch
B. BN reduces internal covariate shift, accelerates convergence, and prevents overfitting by adding a small amount of noise to the activations. In our experiments, we observed that BN reduced the training time by 23% and improved the model’s generalization ability on the small-scale high school mathematics dataset.
BiLSTM provides global contextual understanding of the entire sentence, while ResNet blocks extract fine-grained local features critical for multi-word term recognition. Their combination creates a unified framework that simultaneously models both global semantic coherence and local term structure, which is essential for accurate subject domain NER.
4.5. Knowledge Fusion
Due to the different sources of information, knowledge extracted from the unstructured data inevitably contains a certain redundancy. As a result, entity alignment is essential to optimize the knowledge graph. The cosine similarity is used to eliminate a large number of duplicate or conflicting entities. And then, the two-layer entity alignment method is adopted for knowledge fusion; the pseudocode of the knowledge fusion algorithm is shown in Algorithm 1. It should be emphasized that the proposed two-layer entity alignment strategy has sequential and complementary logic without functional overlap. The first-layer alignment based on Word2Vec focuses on coarse-grained duplicate removal by calculating the similarity of entity surface names, which can quickly filter out obviously repeated entities and reduce the computational cost of subsequent processing. The second-layer alignment based on Doc2Vec aims at fine-grained semantic disambiguation by using entity description information, so as to solve the problems of entity ambiguity and homonymy that cannot be distinguished only by literal features. The two layers cooperate in a progressive manner to ensure the accuracy and efficiency of knowledge fusion. As shown in Algorithm 1, there are three main parts of this algorithm. The description of the three parts is as follows.
| Algorithm 1: Knowledge Fusion |
![Systems 14 00623 i001 Systems 14 00623 i001]() |
In part one, the set
E of all entities is obtained from the triple
S. And then, the word vector model is used to obtain the vector representations of entities. The set of entity vectors is
V. Thirdly, the cosine similarity between entity pairs is calculated. Based on the cosine similarity matrix, merge entity pairs with similarity greater than the threshold and perform the first entity alignment. The selection of the similarity threshold refers to the research of Ijebu et al., who set a similarity threshold of 0.6 to determine whether the problem pairs are repeated [
40]. Based on this empirical threshold,
is set. Get the new set of processed entities
. However, directly applying a threshold from the general duplicate problem pair detection domain without targeted validation on our educational dataset is a limitation. High school mathematics entities have distinct characteristics compared with general natural language entities: (1) High standardization of terminology, with most concepts having unified formal definitions across textbooks; (2) Limited and well-defined formal aliases, with rare semantic ambiguity; (3) Clear hierarchical relationships between concepts, which reduces the complexity of entity disambiguation. To rigorously validate the appropriateness of the 0.6 threshold for our specific dataset, we conducted a systematic threshold sensitivity analysis for both layers of the entity alignment method. We set the threshold range from 0.4 to 0.8 with an interval of 0.1, i.e., 0.4, 0.5, 0.6, 0.7, 0.8, and performed independent threshold tests for the first-layer Word2Vec-based coarse-grained alignment and the second-layer Doc2Vec-based fine-grained alignment. For each threshold combination, we evaluated four core metrics that comprehensively reflect the quality of the knowledge graph: (1) Entity redundancy rate: The proportion of remaining duplicate or synonymous entities after alignment; (2) Knowledge graph completeness: The proportion of valid unique entities retained without over-merging; (3) Triple accuracy: The semantic correctness rate of knowledge triples after entity merging; and (4) Downstream QA F1-score: The F1-score of a rule-based knowledge question answering system built on the constructed knowledge graph. See
Section 5.3 for details.
In part two, we need to search for processed entities from the knowledge base. The knowledge base needs to contain entities as well as entity descriptions. In this paper, Baidu Encyclopedia is selected as the knowledge base. Once an entity can be searched in the knowledge base, the detailed information about this entity can be extracted from the knowledge base to supplement the original data as entity descriptions. The entity description set obtained from the Baidu Encyclopedia is
. If an entity is not found in the knowledge base, ERNIE Bot Large Language Model
https://yiyan.baidu.com/ would be used to generate entity descriptions for reference. It should be emphasized that ERNIE Bot is only used to generate descriptions for a small proportion of entities that cannot be linked to the knowledge base. All such generated descriptions are strictly checked manually by our research team members to ensure 100% accuracy of factual content and conceptual definitions, so as to avoid any possible hallucination or unreliable information. Since entity description generation is a standardized factual task rather than innovative creation, the LLM has a very low probability of hallucination in this scenario. The entity description set generated by the ERNIE Bot Large Language Model is
. Then, obtain all of entity description set
D.
In part three, the Doc2vec model is used to represent entity descriptions as vectors, and the generated vectors are used as entity vectors for the second entity alignment. The cosine similarity is again utilized to remove redundant entities to obtain the final entity set .
In the merging process, if the similar entities and are originally related, the relation between these two entities can be ignored. When considering that is replaced with and is related to other entities, if is related to , then the relationship between and is preserved; if both and are related to , the original relation between and will be kept, and the relation between the original and will also be kept. After that, the final triples can be obtained.
4.6. Integration of Core Knowledge Graph Construction Modules
The seamless integration among the entity recognition, relationship extraction, and knowledge fusion modules is essential for the system’s overall performance and reliability. This subsection elaborates on the sequential dependencies, data format specifications, and error handling mechanisms that govern inter-module communication.
The entity recognition module outputs a structured JSON object for each input sentence, containing the following fields: (1) entity text, (2) entity type, e.g., “preparatory knowledge”, “numbers and algebra”, “geometry”, (3) start and end character offsets in the original text, (4) model-generated confidence score (ranging from 0 to 1), and (5) a 768-dimensional semantic embedding vector generated by the final hidden layer of the BERT model. This structured output is passed directly to the relationship extraction module without further transformation, ensuring that all contextual and semantic information is preserved.
The relationship extraction module takes as input the complete sentence embedding and all pairs of recognized entities within the same sentence. For each entity pair, it generates a set of candidate relations with associated confidence scores. The module outputs a list of candidate triples, each consisting of the head entity embedding, tail entity embedding, relation type, and combined confidence score, calculated as the product of the entity recognition confidence scores and the relation classification confidence score. Only triples with a combined confidence score above 0.5 are retained for further processing, a threshold chosen to eliminate obviously spurious results while minimizing the loss of potentially valuable information.
The knowledge fusion module receives the filtered candidate triples and processes them in two stages. First, entity alignment is performed to map equivalent entities from different sources to a single canonical representation. This step uses the cosine similarity between entity embeddings and semantic descriptions retrieved from external knowledge bases. Second, aligned entities are merged, and their associated relations are consolidated to eliminate duplicates and resolve contradictions. The final fused triples are then stored in the graph database, with all original confidence scores retained for traceability.
Error handling is implemented at each module boundary. If a module fails to process an input, e.g., due to malformed data or out-of-vocabulary entities, the input is logged and skipped rather than causing the entire pipeline to fail. Additionally, a post-processing step identifies and removes triples that violate domain-specific logical constraints, e.g., a geometric figure cannot be a subtype of an algebraic concept, further improving the quality of the final KG.