Behavior-Rule Inference Based on Hyponymy–Hypernymy Knowledge Tree

Huanlai Zhou; Jianyu Guo; Haitao Jia; Kaishi Wang; Lei Guo; Long Qi

doi:10.3390/electronics14244789

,

and

¹

Chengdu Quantum Matrix Technology Co., Ltd., Chengdu 610066, China

²

School of Resources and Environment, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

^†

Current address: Hefei iFLY Digital Technology Co., Ltd., Hefei 230088, China.

Electronics2025, 14(24), 4789;https://doi.org/10.3390/electronics14244789

Version Notes

Order Reprints

Review Reports

Abstract

Behavior-rule reasoning aims to infer the corresponding applicable rules from specific behaviors and is a type of inductive reasoning that goes from special cases to general ones. This paper proposes a behavior-rule inference model based on hyponymy–hypernymy knowledge trees, which maps behaviors to corresponding rules through deep learning by processing textual behavior sequences. The primary contributions of this work are threefold: we present a systematic framework that adapts K-BERT to the legal domain by integrating domain-specific hyponymy–hypernymy knowledge trees, addressing the unique challenges of legal text understanding; we conduct comprehensive optimization of key components, including context length, loss function, and base model selection, providing empirical guidelines for applying pre-trained models to legal reasoning tasks; and we propose a practical evaluation metric (tolerance) that mimics real-world legal decision-making processes, providing extensive analysis on the effectiveness of different knowledge types in legal inference.

Keywords:

behavior-rule inference; hyponymy-hypernymy knowledge; K-BERT; multi-label classification

1. Introduction

Behavioral-rule reasoning provides a deeper understanding of the nature of things and the inherent patterns of behavior by observing and analyzing specific behaviors and inferring potential universal rules. Behavioral-rule reasoning is essentially a kind of inductive reasoning from the particular to the general; for example, a judge determines the applicable laws based on the various types of behaviors involved in a case, and a doctor deduces the possible patterns of a disease by analyzing a patient’s specific symptoms and behaviors. The rapid development of deep learning technologies provides unprecedented opportunities for solving complex reasoning problems, with the most critical technologies being pre-trained models and knowledge graphs.

Pre-trained models are able to learn rich semantic information from text by learning and training on large-scale corpora and can apply it to various text-understanding tasks. By pre-training on large-scale data, models can learn generic feature representations that can be migrated to new tasks, allowing for better results on small-scale datasets. Pre-trained models typically require significant computational resources and time to train, but once pre-training is complete, they can be easily fine-tuned on new tasks without the need to re-train the entire model.

Knowledge graphs, as a structured knowledge representation, have the ability to semantically model relationships and entities. Knowledge graphs usually consist of nodes and edges, where nodes represent real-world entities such as people, places, and things, and edges represent the associative relationships between these entities. The goal of knowledge graphs is to capture and organize rich knowledge so that computer systems can understand and reason about the world. Knowledge graphs can be used to support artificial intelligence applications such as question-and-answer systems, linguistic reasoning, and natural language generation by helping AI systems understand the entities and relationships involved in a text; they can also support entity disambiguation and relational reasoning.

The construction of a human knowledge system cannot be separated from language. The entities in language exist in contextual relationships. Contextual relationships [1] (Hypernymy) are a basic semantic relationship used to describe the hierarchical affiliation between concepts; for example, “Beagle is a canine”, and “Canine is a mammal.“ In natural language processing, contextual relations can be considered as a kind of knowledge graph to help models better understand human knowledge.

K-BERT aligns the entities in the knowledge graph with those in the text sequence and incorporates the relationship information in the knowledge graph into the pre-training process of BERT. In this paper, based on K-BERT and contextual knowledge trees, the behavior-rule inference module is constructed to realize the embedding, classification, and inference of input behaviors, and finally obtaining the rules corresponding to the input behaviors. However, in the field of law, especially in tasks involving reasoning about rules of conduct, existing knowledge-enhancement methods still face challenges. The hierarchical and specialized nature of legal knowledge, as well as the logical requirements of reasoning, demand more refined knowledge representation and integration mechanisms.

2. Theoretical Basis

2.1. Transformer

The Transformer [2] model is a sequence-to-sequence model whose input and output are sequences of tokens. It was originally created to handle machine translation tasks and initially drew attention due to its outstanding performance in the field of machine translation tasks. But, later research proved that the Transformer can be easily applied to other domains because of its well-characterized sequence-to-sequence architecture, and its output can be used to extract information by performing various forms of operations through the fully connected layer, not limited to vocabulary-to-vocabulary translation. As Figure 1 shows, this is the Transformer structure.

Figure 1. Structure of Transformer model (Source: Authors).

The Transformer model includes a multi-layer encoder–decoder structure. Each layer of the encoder includes two sub-layers: the first one is a multi-head self-attention layer, and the second one is a feed-forward neural network layer. Each self-attention head weights the embedding of the whole sentence once, extracts out the part attended to by that attention head, and finally performs the splicing of the attention vectors with residual concatenation. The feed-forward neural network layer receives the output of the attention layer as input and maps the weighted semantic information to the vector space where the output is more informative. The difference between encoder and decoder is the masking strategy of the self-attention layer, decoder only allows the self-attention layer to view the tokens before the current token, which is known as unidirectional self-attention, whereas encoder allows the self-attention layer to view the tokens before and after the current token, which is known as bidirectional self-attention.

The self-attention mechanism is the core difference between the Transformer model and the past structures such as CNN, RNN, LSTM, etc. The Transformer model is also much higher than the past models in terms of parameters, but still the back propagation of the gradient to drive the updating of the parameters to reduce the loss function as the training goal, after multiple generations of training, so that the predicted output and the real answer as much as possible, to complete the learning process. complete the learning process.

2.2. BERT

The Transformer model has become the most important and fundamental model in the field of natural language processing, and even in the entire field of deep learning. BERT [3] is a pre trained language model based on the Transformer architecture, also known as Bidirectional Encoder Representations from Transformers, which can effectively encode the semantic information of text sequences. The GPT-1 (Generative Pre trained Transformer 1) [4] based on the Transformer architecture released by OpenAI uses a pre training and fine-tuning method for training, and has achieved good results on multiple natural language generation tasks. Subsequent releases of GPT-2 [5], GPT-3 [6], and GPT-3.5 [7] have significantly increased model size and training data size, demonstrating better performance.

The structure of BERT model is shown in Figure 2. BERT is mainly divided into two phases: the pre-training phase and the fine-tuning phase. In the pre-training phase, BERT uses a large-scale unlabeled text corpus to train the model, and the training tasks include MLM (Masked Language Model) and NSP (Next Sentence Prediction), and the BERT pre-training combines the two tasks at the same time, summing up all the loses, and updating the model with the gradient descent parameters. In the fine-tuning phase, the pre-trained model of BERT is used as a feature extractor, and then one or more classifiers are added to fine-tune it according to different tasks. Since BERT obtains a large amount of linguistic knowledge in the pre-training phase, only a small amount of fine-tuning is needed in the fine-tuning phase to achieve better results in downstream tasks (e.g., text categorization, named entity recognition, etc.).

Figure 2. Structure of BERT model (Source: Authors).

The coding layer of the BERT model is essentially a stack of multiple Transformer structure encoders, in the case of BERT-base, it is a stack of 12 Transform encoders, of which the attention heads are 12. BERT only takes the encoder part of the Transformer. the Transformer structure encoders include the self-attention layer, residual network layer, and feedforward neural network layer. The details have been described in the previous section. The encoding layer of BERT performs the operations of the self-attention layer, the feedforward neural network layer and the residual network layer of the Transformer structure.

The design idea of the decoding layer of the BERT model is similar to that of the Transformer decoder, both of which connect a customized network with a Softmax function on top of the output of the encoding layer to process all the previous outputs as probabilities. As a pre-trained model, the decoding layer of BERT can actually be customized and trained by the user according to the needs, usually it is only necessary to define a fully connected layer and Softmax layer according to the specific needs of the task, which allows BERT to complete the traditional classification tasks. For tasks such as sequence annotation, translation, etc., the problem can be handled in the form of downscaling the problem to classification of each token, calculating the probability of translating to each word, and so on.

2.3. Extraction of Hierarchical Relationships

The typical techniques for identifying hierarchical relationships are mainly divided into pattern based methods and statistical and machine learning based methods [8]. The pattern based approach combines natural language processing with linguistic knowledge. Firstly, lexical and grammatical analysis is used to obtain the patterns of hierarchical relationships between entities, and then pattern matching is used to extract hierarchical relationships. The pattern matching method is relatively accurate, but it is limited due to the difficulty in exhausting patterns or the high demand for corpus. Only when two concepts appear simultaneously in a sentence can the relevant hierarchical relationships be extracted.

Tang Qing et al. [9] proposed a method for extracting hierarchical relationships based on a combination of syntactic analysis and rule matching. Li Junfeng et al. [10] proposed a semantic hierarchy acquisition method for patent domain ontology concepts, which utilizes relative modification degree and association rules to identify hierarchical relationships. Zhang Chentong et al. [11] used rule algorithms to identify the hierarchical relationships and designed a data enhanced BERT hierarchical relationship recognition algorithm when constructing a large-scale disease terminology map. Snow [12] extracts hierarchical relationships by automatically extracting patterns based on dependency paths. Suchanek et al. [13] extended the semantic resources contained in WordNet by combining structured knowledge from Wikipedia and knowledge from WordNet dictionaries. Lu Kaihua et al. [14] proposed a specification for annotation of hierarchical relationships and annotated a high-quality dataset containing Chinese word pairs. They proposed a dependency path representation model that integrates multiple features and encodes the dependency path information in co-occurrence sentences of word pairs.

Using distributed representations of two concepts based on machine learning methods to infer whether they have a hierarchical relationship can help solve the problem of co-occurrence sparsity between concepts.

Supervised learning mainly uses word embedding projection techniques and classification methods to train contextual relationship classification models. Huang Yi et al. [15] proposed a Conditional Random Field (CRF)-based domain term contextual relationship acquisition method, which utilizes the CRF machine learning technique to learn the intrinsic laws of contextual relationships between terms and obtain the probabilistic model of their expressions and existence environments. Ma et al. [16] combined word vector and Boot strapping for domain entity contextual relationship extraction. Sun Jiawei et al. [17] proposed to construct a fused word pattern embedding model using a simple feed forward neural network to fully utilize the contextual information of the utterance and the semantic information of the words. Wu Ting [18] et al. used synonymous reasoning to construct chapter-level entity contextual relationships and conducted experiments with texts in the field of national defense science and technology. Cheng-Yu Wang et al. [19] analyze the development framework of different word embedding projection models and their application in contextual relationship prediction.

Unsupervised learning mainly uses distributional assumptions and similarity methods for contextual relationship discrimination. Kozareva et al. [20] propose a better semi-supervised contextual relationship extraction method by combining WordNet and other dictionaries, but it is not applicable to the domain-specific contextual relationship extraction task due to the lack of specialized domain vocabulary in the dictionaries. Fu et al. [21] propose a remotely-supervised approach to extract entity superlatives from multivariate data to extract entity superlatives, and then utilize a statistical ranking model based on the new feature set to rank the candidate superlatives. However, the final output of the method is a list of superlatives for a given entity, and there may still exist some hierarchical relationships in these superlatives, while it is difficult for the method to recognize ultra-low-frequency superlatives well; and there are also some entities with too high co-occurrence frequency, which will be wrong due to the high ranking in the sorting. There are also scholars who try to extract the superordinate relations from the perspective of linguistics and semantic features. Liu Qi et al. [22] proposed a recognition method for typical class relations of Internet texts by extracting linguistic features and contextual semantic features to classify entities. Gan Lixin et al. [23] proposed a Chinese entity relationship extraction method based on syntactic semantic features by combining entity-dependent syntactic relationship formation features, which achieved good results.

After the emergence of deep learning technology, the aforementioned pattern matching based methods, machine learning and statistics based methods, and word vector based methods achieve less accuracy than the deep learning domain. Yejin Cho [24] constructed an encoder model to predict the context path by generating the path, which guarantees a certain degree of interpretability, while achieving better results. CTP [8] conducted experiments based on BERT with direct input of “A is B” and sentence pair reasoning, and proved its optimality among the current language models.

3. Results Model Construction

3.1. Overall Model Framewoek

For specific domains of specialization, especially in areas such as law, BERT cannot achieve the desired results due to the lack of specialized knowledge in the training corpus. Although there are methods for pre-training using large-scale corpora containing specialized knowledge, the process is time-consuming and computationally resource-intensive, and the cost of pre-training is unaffordable for most users, and the source and quality of the specialized knowledge corpus is difficult to ensure.

To solve the problem of BERT’s poor results in the specialized domain, one approach is to inject contextual knowledge trees into BERT. As a compact knowledge base containing structured knowledge, the integration of contextual knowledge trees into BERT can endow BERT with a certain degree of specialized domain knowledge. However, there are two main problems in injecting knowledge base for BERT: firstly, the Heterogeneous Embedding Space (HES), i.e., the vocabulary in the text is acquired in a different way from the entities in the external knowledge base, which leads to discontinuity in the vector space in which they are embedded, and makes it difficult to correspond effectively in the reasoning process; and secondly, the problem of knowledge noise, since the input to BERT is a one-dimensional linear sequence, and injecting a large amount of knowledge will transform the sequence into a two-dimensional multinomial tree-like structure, too much knowledge injection will significantly affect the performance of the BERT pre-training model and become a source of knowledge noise.

The K-BERT model [25] solves the above problem to some extent by finely injecting contextual knowledge into BERT, which makes the model perform better in inference tasks. K-BERT was proposed by WeijieLiu in 2019, and is a kind of BERT capable of applying contextual knowledge trees. In this section, firstly, the model structure of the original version of K-BERT is explained, and then explains the innovations made in this paper based on the original work. The structure of K-BERT model is shown in Figure 3.

Figure 3. Architecture of the K-BERT model (Source: Reprinted from Liu et al. [25]).

The model architecture of K-BERT consists of four modules, i.e., Knowledge Layer, Embedding Layer, Field of View Layer, and Mask-Transformer. For the input sentences, the Knowledge Layer first retrieves and injects the relevant triples from the Knowledge Graph to fuse the original sentences into the knowledge, so as to uplift it from a two-dimensional linear sentence to the semantic tree that contains the knowledge. The semantic tree is then fed into both the embedding and vision layers and converted into a Token-level embedding representation and a vision matrix. The field of view matrix is used to control the visible range of each Token to prevent the meaning of the original sentence from being changed due to the injection of too much knowledge. The model structure of K-BERT is shown in Figure 3. The main innovation is to mitigate the problem of heterogeneous embedding space and knowledge noise through the knowledge, field of view, and mask-transformer layers, and introduce knowledge into the BERT in a form that does not lose the original semantic information. BERT introduces knowledge in a form that does not lose the original semantic information.

In this paper, the following improvements are made to the model:

-: Replace the base model from BERT-base to RoBERTa model, and extend the context length from 256 to 512 to solve the problem of missing coded information due to insufficient context length.
-: Compare the effects of different knowledge graphs on RoBERTa model. The effects of HowNet, specific types of dispute maps, all types of dispute maps, and no maps on the discriminative effect of K-BERT model are compared.
-: The concept of tolerance is proposed, and an implementation path of rule-based recommendation task based on multi-label classification is given.

3.2. Construction of Contextual Knowledge Tree

The contextual knowledge tree takes behavioral concepts as nodes and contextual relationships of behaviors as edges. Like HowNet, the contextual knowledge tree can be used as an external knowledge base for K-BERT to enhance its reasoning. Generally speaking, the task of constructing a contextual knowledge tree consists of two parts, the first part is contextual relationship identification, i.e., for two given words, determining whether or not they have a contextual relationship, and the second part is knowledge tree reorganization, i.e., for multiple pairs of words in the same domain with contextual relationships, reorganizing them into a tree with a hierarchical structure of hierarchy. HowNet [26] is a large-scale Chinese vocabulary and conceptual tree proposed by Dong et al. proposes a large linguistic knowledge base of Chinese words and concepts, in which each Chinese word is annotated with a semantic unit called a semantic element. If subject, object, and relation are taken as a ternary group, HowNet can be used as a contextual knowledge tree.

Traditional contextual relationship recognition methods are based on pattern or distributed models, which reason about the contextual relationships between words from a large corpus, and the pattern-based method is most representative of Herast’s. Roller proved that the pattern-based model is better than the distributed model in most cases. Yuet pointed out that pattern-based extraction methods suffer from sparsity, and combining pattern-based and distributed methods through a complementary framework can effectively solve the sparsity problem and improve the results. The models before the transformer model are mainly based on rules and machine learning, and these models are unable to fully utilize the knowledge in the corpus due to too few parameters, resulting in far less effective extraction than the transformer model.

In the contextual relationship recognition part, this paper differs from the previous class-based or distributed model inference, but inference is based on the knowledge encoded in the BERT class of pre-trained language models. Essentially, the approach in this paper is to extract the rich patterns in the corpus in a more refined way, and creatively add the paraphrases obtained from the generative language model to the vocabulary, expand the vocabulary semantics through the paraphrases of the generative model, and finally extract the knowledge in it with a deep transformer structure, thus obtaining an inference accuracy rate far exceeding that of the traditional model.

The contextual relationship recognition based on the pre-trained model outputs a fully-connected entitlement graph, which decouples the work of reorganization from that of recognition, and enables the use of the fully-connected entitlement graph as an input to the reorganization phase in order to apply the algorithms related to maximum spanning trees. In this paper, the Chu-Liu-Edmonds algorithm, which has a computational complexity of no more than

O (n^{2})

in the face of medium-sized phrase scenarios, is used to find the maximum spanning tree within a phrase.

In the first step, for a given set V of n words, the relations between words are constructed traversally to obtain n(n − 1) relations. Because of the limitation of the tree structure, the number of true label relations obtained from wordnet in a tree of n nodes is n − 1, so that there are n − 1 positive relations and

{(n - 1)}^{2}

negative relations for a tree. Without the inclusion of gloss (In a knowledge graph, it usually refers to the definition or descriptive text of an entity), a sample “[CLS] (Classification token, used to aggregate semantic information of the entire sequence, commonly used for classification tasks) vi is a vj [SEP] (Separator marker, used to distinguish multiple sentences or fragments)” is constructed as an input to the model for one relation between the words vj and vj. In the case of adding gloss, a sample “[CLS]vi:vi’s gloss [SEP] vj:vj’s gloss [SEP]” is constructed as an input to the model for one relation between the words vi and vj. After the inputs are constructed, each relation is computed using the BERT family model and given logits to predict whether it is positive or negative, but here instead of getting the predicted labels directly in the form of argmax, the logits are used to form a fully-connected directed graph, which is fed into the second step.

In the second step, the Chu-Liu-Edmonds algorithm is used to find the maximum spanning tree of the directed graph, the significance of this step is to reorganize the graph into a tree structure, which improves the accuracy of constructing the contextual knowledge tree through structural adjustment. Thereafter, this paper studies the performance of different models and hyperparameters under the same dataset through the control variable method, and tries to find out the parameter that is most suitable for the model in order to obtain the best results within the scope of the study.

3.3. Knowledge Layer

In Section 3.2, an external knowledge base

(E_{1}, R, E_{2})

is constructed. In this paper, the relationships in the external knowledge base are only “is a” relationship, i.e., the relationship of superiority and inferiority, where

E_{1}

is the inferior word and

E_{2}

is the superior word. The input sentence is processed in the knowledge layer, and the superordinate word is added to the subordinate word in the sentence in two dimensions, and the output is added to the semantic tree that contains the knowledge. The depth of the knowledge graph at the time of semantic tree construction can be defined according to the need, if the depth is 1, then only one superordinate word expansion is performed.

A sentence is denoted by

s = {w_{0}, w_{1}, w_{2}, \dots, w_{n}}

, where n is the length of the sentence. The English Token is word-based, while in this study, in order to adapt to the Chinese scenario, unlike the general scheme of using a lexical model to slice a sentence into words, this paper treats each Chinese character as a Token.

w_{i}

of each Token is contained in a vocabulary list V,

w_{i} \in V

. Denote the Knowledge Graph as K. K is a collection of triples

ϵ = (w_{i}, r_{j}, w_{k})

, where

w_{i}

and

w_{k}

are the names of entities and

r_{j} \in V

is the relationship between them. All the triples are in the Knowledge Graph, i.e.,

ϵ \in k

.

The Knowledge Layer (KL) is used for sentence knowledge injection and sentence tree transformation. Specifically, given an input sentence

s = {w_{0}, w_{1}, w_{2}, \dots, w_{n}}

and a knowledge graph K, KL outputs a semantic tree of a sentence

t = w_{0}, w_{1}, \dots, w_{i} {(r_{i_{0}}, w_{i_{0}}), \dots, (r_{i_{k}}, w_{i_{k}})}, \dots, w_{n}

. This process can be divided into two steps: knowledge query (K-Query) and knowledge injection (K-Inject).

In K-Query, all involved entity names are selected from the sentence s and their corresponding triples in K are queried. K-Query can be represented as Equation (1), where

E = {(w_{i}, r_{i_{0}}, w_{i_{0}}), \dots, (w_{i}, r_{i_{k}}, w_{i_{k}})}

, is the set of corresponding triples.

E = K_Q u e r y (s, K)

(1)

t = K_I n j e c t (s, E)

(2)

Next, K-Inject injects knowledge by inserting the query result in E into the corresponding position in sentence s to generate a semantic tree t. The structure of the semantic tree is shown in Figure 4. In this paper, a sentence tree can have multiple branches, but its depth is fixed to 1, which means that the entity names in the triad will not iteratively generate branches. K-Inject can be represented as Equation (2).

Figure 4. Semantic tree structure (Source: Reprinted from Liu et al. [25]).

It is important to note that during this transformation from a linear sequence to a semantic tree, the number of nodes (words) and connections (syntactic relations) from the original sentence is preserved without reduction. The process solely involves the additive injection of new knowledge entities and their relational edges from the knowledge graph.

3.4. Embedding Layer

The function of the Embedding Layer is to transform the semantic tree into an embedding representation that can be input into the Transformer coding layer. Similar to BERT, the embedding representation of K-BERT is the sum of three parts: token embedding, positional embedding, and clause embedding, but the difference is that the input to the Embedding Layer of K-BERT is a semantic tree rather than a linear sequence of tokens. Therefore, how to convert the semantic tree into a sequence while still preserving its structural information is the key to K-BERT.

3.4.1. Token Embedding

In this work, the Token embedding is aligned with BERT, and in this paper, we use the word list provided by BERT-base provided by Google. Each token in the semantic tree is transformed into an embedding vector of dimension H by a trainable dictionary of word vectors. In addition, K-BERT uses [CLS] as the classification label and masks the Token using [MASK]. The difference between K-BERT and BERT for token embedding is that the tokens in the sentence tree need to be rearranged before the embedding operation. In the rearrangement strategy, the tokens in a branch are inserted behind the corresponding node, while the subsequent Token is moved backward.

In the example shown in Figure 5, the semantic tree is rearranged as “Tim Cook CEO Apple is visiting Beijing capital China is a City now”. Although this process is not complicated, it makes the sentence lose the natural language ordering, unreadable and lose the correct structural information, which is the core problem solved by K-BERT, which is addressed in this paper by the soft positional embedding and the vision matrix of the vision layer.

Figure 5. Transformation of an input sentence into a semantic tree via vision matrix (Source: Reprinted from Liu et al. [25]).

3.4.2. Soft Positional Embedding

For BERT, without positional embedding, it would be equivalent to a bag-of-words model, resulting in a lack of structural information (i.e., order of Token, order of sentence). The order of sentences is undoubtedly important; for a subject-verb-object structure, if the subject and object positions are swapped, the semantics of the sentence is completely different. Therefore, positional embedding is a necessary module in language modeling.

In K-BERT, all the structural information of the input sentence is contained in the positional embedding, thus enabling the structural information lost during knowledge injection to be added back to the rearranged unreadable sentence. Taking the semantic tree in Figure 5 as an example, after rearranging, [CEO] and [Apple] are inserted between [Cook] and [is], but the subject of [is] should be [Cook] instead of [Apple]. To solve this problem, it is only necessary to set the position number of [is] to 3 instead of 5. Thus, when calculating the self-attention score in the converter encoder, [is] passes equivalently in the next position of [Cook]. However this leads to another problem, both [is] and [CEO] have a position number of 3, which makes them very close to each other when calculating self-attention, but there is actually no informational connection between them. The solution to this problem is the masked self-attention mechanism, which is described in the next section.

3.4.3. Clause Embedding

Similar to BERT, K-BERT also uses paragraph embedding to recognize different sentences when they contain multiple sentences. For example, when two sentences

{w_{00}, w_{01}, \dots w_{0 n}}

and

{w_{10}, w_{11}, \dots w_{1 m}}

are entered, they are merged into a single sentence

[C L S], w_{00}, w_{01}, \dots w_{0 n}, [S E P], w_{10}, w_{11}, \dots w_{1 m}

, which is separated using [SEP]. For the merged sentence, a paragraph marker is used to distinguish the different sources, e.g.,

{A, A, A, A \dots A, B, B, B \dots B}

.

3.5. Vision Layer

The semantic tree is synchronized into the embedding layer and the horizon layer, both of which output embedding and horizon matrices. The embedding layer is divided into token, position, and clause embeddings as in BERT. However, the embedding layer of K-BERT inputs semantic trees instead of sentences, how to retain the features of trees while embedding sentences is the core problem of K-BERT. As shown in Figure 5, the sentence is reprogrammed as “Tim Cook CEO Apple is visiting Beijing capital China is a City now”, which leads to an unreadable sentence and loss of structural information, but it can be solved by the soft position and vision matrices. This will result in an unreadable sentence that loses structural information, but can be solved by soft location and vision matrix. The field of view matrix is shown in Figure 5.

For BERT, if there is no positional embedding, it is equivalent to input word bag, the order information of words in BERT all rely on the positional embedding, the reason is that there is a CEO3 before is in the sentence in Figure 5, Apple4 goes to modify the word, but it should modify Cook2 in fact, so it directly changes the positional embedding of is to 3 so that there are two 3’s: CEO and is, the way to solve the problem of two 3’s is the vision matrix. In the vision matrix. The hard positional embeddings (positional embeddings of the original sentence) are always visible to each other, in addition, each word in the original sentence can additionally see the part of the knowledge that modifies itself, e.g., Tim Cook’s visibility is 1, 2 rows in the vision matrix, and V(1,8), V(1,9), etc. in this matrix are white, which means that word 1 can’t see word 8 and word 9. by this method, K-BERT effectively avoids the knowledge noise problem.

The field of view layer is the biggest difference between K-BERT and BERT. The field of view layer is the biggest difference between K-BERT and BERT, and also the reason why K-BERT can inject knowledge without causing semantic confusion. The vision matrix is to prevent the original sentence from being tampered with by the injected knowledge, to solve the problems of the existence of 2 pos3 in positional embedding and the semantic confusion caused by the introduced knowledge. If China, Apple, etc. interact with each other in the sentence, the meaning of the sentence will be misinterpreted. The vision matrix specifies the vision of each word, and sets 1 if word i and word j are in the same branch, and 0 otherwise.

The input to K-BERT is a semantic tree where branches are knowledge obtained from the knowledge graph. However, the problem posed by knowledge is that it may lead to changes in the meaning and order of the original sentences, i.e., the knowledge injection problem. For example, in the semantic tree in Figure 5, [China] only modifies [Beijing], which is not related to [Apple]. Therefore, the representation of [Apple] should not be affected by [China]. On the other hand, the [CLS] tag used for categorization should not bypass [Cook] to get information about [Apple], as this may lead to semantic confusion. To avoid this, K-BERT uses a field-of-view matrix

M

to restrict the visible region of each tag so that [Apple] and [China], and [CLS] and [Apple] are not visible to each other. The field of view matrix

M

is defined as shown in Equation (3),where means that

w_{i}

and

w_{j}

are on the same branch, while % means that both are not on the same branch. i and j are hard position indices.

M_{i j} = \{\begin{matrix} 0 & , w_{i} ⊖ w_{j} \\ - \infty & , w_{i} ⊘ w_{j} \end{matrix}

(3)

3.6. Mask-Transformer Layer

In K-BERT, the field-of-view matrix M in fact also contains information about the structure of the semantic tree. Since the Transformer encoder in BERT cannot receive the visual field matrix

M

as an input, it needs to be modified to obtain the Mask-Transformer, which can de-constrain the region of self-attention according to

M

, which can reduce the complexity of the self-attention operation to some extent. Like BERT, Mask-Transformer is stacked by multiple masked self-attention blocks, denoting the number of stacked layers of the model encoder as L, the hidden layer dimension as H, and the number of self-attention headers as

A

.

In order to utilize the sentence structure information in

M

to prevent semantic confusion, this paper proposes the mask self-attention mechanism, which is an extension of self-attention. Formally, the mask self-attention is shown in Equation (4). Formulas (4)–(6) are all derived from Liu et al. [25]

Q^{i + 1} = h^{i} W^{q}, K^{i + 1} = h^{i} W^{k}, V^{i + 1} = h^{i} W^{v}

(4)

S^{i + 1} = s o f t m a x (\frac{Q^{i + 1} * {(K^{i + 1})}^{T} + M}{\sqrt{d_{k}}})

(5)

h^{i + 1} = S^{i + 1} V^{i + 1}

(6)

Q^{i + 1}

: Query matrix at the

(i + 1)

-th layer, derived from the hidden state

h^{i}

.

K^{i + 1}

: Key matrix at the

(i + 1)

-th layer, derived from the hidden state

h^{i}

. Note: To avoid confusion with the Knowledge base (denoted as KG in our paper), we use K exclusively for Key matrices.

V^{i + 1}

: Value matrix at the

(i + 1)

-th layer, derived from the hidden state

h^{i}

.

h^{i}

: Hidden state at the

i

-th layer.

W^{q}

,

W^{k}

and

W^{v}

are trainable weight matrices in the self-attention mechanism. They are initialized randomly and updated during training via backpropagation to minimize the loss function. Specifically, we use the Adam optimizer with a learning rate of 0.0001 and a weight decay of 0.01 to optimize these parameters. The training process involves forward propagation to compute the output, followed by backpropagation to calculate gradients and update weights.

h^{i}

is the hiding state of the ith mask self-attention block.

d_{k}

is the scaling factor, used to control the distribution of attention weights. m is the field of view matrix computed from the observation layer. Specifically, if

w_{k}

is invisible to

w_{j}

, then

M_{j k}

will mask the attention score, which means that

w_{k}

has no effect on the hidden state of

w_{j}

.

The softmax function is applied to the scaled dot-product attention scores to obtain the attention weights. Mathematically, it is defined as

softmax (x_{i}) = \frac{e^{x_{i}}}{\sum_{j} e^{x_{j}}}

, where

x_{i}

represents the attention score for the i-th token. This ensures that the weights sum to 1 and allows for probabilistic interpretation.

As shown in Figure 6,

h_{[A p p l e]}^{i}

has no effect on

h_{[C L S]}^{i + 1}

because [Apple] is not visible to [CLS]. However,

h_{[C L S]}^{i + 1}

can indirectly obtain information about

h_{[A P P L E]}^{i - 1}

through

h_{[A P P L E]}^{i - 1}

because [Apple] is visible to [Cook], which is visible to [CLS]. The advantage of this process is that [Apple] enriches the representation of [Cook] without directly affecting the meaning of the original sentence, thus realizing the design goal of K-BERT.

Figure 6. Mask-Transformer is a stack of Mask self-attention mechanisms (Source: Reprinted from Liu et al. [25]).

Transformer only supports sequential inputs, so in order to support two-dimensional inputs, the structure of Transformer is to be changed to Mask-Transformer, which is a stack of multiple Mask-Self-Transformers and has the same structure as Transformer.

As shown in Figure 6, by stacking Mask-Self-Transformers, combined with the visibility matrix,

{c l s}_{i + 1}

can be indirectly encoded by the introduced knowledge between self-attention layers of different depths, but without directly affecting the original sentence meaning.

4. Experimentation

4.1. Dataset

The self-constructed dataset in this paper originates from the Chinese Judgment Document Network. It leverages web scraping technology to acquire a substantial corpus of behavior-legal provision mappings for divorce disputes. These corpora were categorized and differentiated based on various dispute types, resulting in a comprehensive dataset employed for model training and validation. The training set comprises 652 case samples, while the validation set consists of 163 case samples. In total, the dataset encompasses 809 distinct legal provision labels. The total number of labels in the validation set is 350, with label overlap, indicating that the same legal provision may simultaneously appear in multiple cases.

This paper also utilizes DuEE, the largest current Chinese event extraction dataset, as a baseline. It compares the impact of BCELoss and CELoss on model performance in a multi-label classification scenario. The DuEE dataset, derived from Baidu’s 2020 Language and Intelligence Technology Competition, includes 17,000 sentences with event information across 65 event types (20,000 events). It is divided into a training set of 12,000 samples, a validation set of 1500 samples, and a test set of 3500 samples.

While this study primarily focuses on divorce cases to enable a deep and focused analysis, we acknowledge that evaluating the model’s performance on a broader range of legal domains is crucial for demonstrating generalizability. Experiments on additional datasets from other legal domains (e.g., contract disputes, criminal cases) are currently underway and will be included in the final version of this work to comprehensively assess the cross-domain applicability of our framework.

4.2. Evaluation Metrics

The evaluation metric in this paper draws inspiration from the TopK recall of recommendation systems [27] and views the behavior-rule reasoning system as a simple recommendation system. This recommendation system has the following characteristics: fixed candidates, with only 809 labels (statutes) as items; each user corresponds to a segment of behavioral text, and the goal of the recommendation system is to rank and provide the top n recommended items based on user features.

The aforementioned n is the common concept of tolerance in the recommendation system. By using tolerance, the ranking problem is transformed into a classification problem [28]. Tolerance classifies the top n scoring labels as positive and the rest as negative. When tolerance = −1, dynamic tolerance is applied. For each case, assuming there are n labels, the top n labels are selected as positive samples based on the model’s output scores, and the rest are negative samples. When tolerance = n, where n is not equal to −1, fixed tolerance is applied, and for each case, the top n predicted labels by the model are taken as the predicted result.

A tolerance of 10 is a relatively loose setting, allowing the model to have some margin of error, while quickly converging to the highest accuracy. A tolerance of −1 is the strictest requirement, where recall is 100% only when precision strictly equals true positive (TP), resulting in 100% accuracy at that point.

The formula for calculating recall is shown in Equation (7), where TP represents true positive labels predicted as true.

R e c a l l = \frac{T P}{T P + F N}

(7)

To ensure comprehensive evaluation and facilitate direct comparison with existing work, we will report a set of standard multi-label classification metrics alongside the tolerance metric. These include Macro-F1, Micro-F1, Precision, Recall, and Precision@K (for fixed K = 5 and K = 10). This multi-metric approach provides a more holistic and universally comparable view of the model’s performance.

We invited five legal experts to conduct a blind review of 100 test cases, evaluating the following dimensions:

-: Accuracy of rule inference (1–5 points)
-: Interpretability of the results (1–5 points)
-: Practical application value (1–5 points)

The results show an average score of 4.2 for accuracy, 3.8 for interpretability, and 4.0 for application value.

4.3. Experimental Configuration

All experiments in this paper were conducted on the same A100 40G, using torch version 1.12, transformer version 4.28.0, and other hyperparameter configurations as shown in Table 1.

Table 1. Hyperparameter configurations for experiments.

Our approach incorporates domain-specific knowledge through a structured injection mechanism. The process begins by extracting relevant conceptual relationships from a legal-domain knowledge graph. These relationships are then embedded into the input text using a tree-based representation that maintains both semantic integrity and structural information. A key innovation is our visibility control system, which intelligently manages how injected knowledge interacts with the original content to prevent information distortion while enhancing contextual understanding.

The model architecture builds upon the transformer-based framework with specific enhancements for knowledge-aware processing. The system features a multi-layer encoding structure that processes both textual and knowledge elements simultaneously. Each layer contains specialized attention mechanisms that handle the injected knowledge structures while preserving the original sentence meaning. The design includes adaptive components that dynamically adjust to different types of domain knowledge and text patterns.

Our training methodology employs a multi-phase approach that progressively incorporates domain knowledge. The process begins with foundation model initialization using pre-trained weights, followed by domain-specific fine-tuning with knowledge-enhanced datasets. We implement specialized regularization techniques that ensure stable learning of knowledge-text interactions. The training optimization focuses on balancing semantic preservation with knowledge integration, using customized loss functions that address the unique challenges of knowledge-enhanced learning.

4.4. Comparative Experiments

4.4.1. Comparison of Different Loss Functions

Cross-Entropy Loss (CE Loss) is mainly used to measure the difference between two probability distributions [29] and is the most commonly used loss function in the field of classification problems. During the training process of a model using the Cross-Entropy Loss function, the difference between the output distribution and the ideal distribution will be used as a coefficient to back-propagate the gradient of the model in order to update the parameters, so that the distribution of the new output of the model after updating the parameters is closer to the ideal distribution after receiving the same inputs, thus realizing the training of the model.

However, CE Loss is generally only applicable to single classification, i.e., classifying a sample into one class, and is not applicable to the multi-label classification [30] problem dealt with in this paper. Therefore, this model replaces the default cross entropy loss function of K-BERT with Binary-Cross-Entropy Loss (BCE Loss), which solves the problem that the cross entropy loss function performs poorly on multi-label classification.

For the task of multi-label classification with n labels, BCE Loss essentially binary-classifies each dimension of labels and determines whether the sample has labels of that dimension, which significantly improves the performance of the model on multi-label classification.

The cross entropy loss function is formulated as follows:

H (y, \hat{y}) = - \sum_{i = 1}^{C} y_{i} {\hat{\log y}}_{i}

(8)

where

y_{i}

can only take the value of 0 or 1, the class to which the sample belongs to is 1, and all other classes are 0.

{\hat{y}}_{i}

is the probability that the model output corresponds to the label after softmax, and the Loss is 0 only when

{\hat{y}}_{i}

is 1. For a single-label classification problem, only one of the above summation equations is 1, and the other terms are 0. For a multilabel classification problem, there may be multiple terms that are 1.

This paper compares the recall rates with BCELoss and CELoss on a public dataset in a multi-label classification scenario, with tolerance set to −1, as shown in Figure 7.

Figure 7. Recall comparison on the public dataset (Drawn by the authors).

It can be observed that BCELoss has a much lower recall rate than CELoss in the initial rounds. This is because CELoss performs only one classification, while BCELoss classifies on each dimension. BCELoss requires multiple epochs of training on each dimension to achieve fine recall effects. When multiple labels with probabilities of 1 are input to CELoss, CELoss can only uniformly distribute the sum of probabilities equal to 1 to positive labels to minimize the loss, resulting in fast convergence but inferior performance compared to BCELoss.

In this paper, on the self-built dataset of divorce dispute behavior-rule mappings, with tolerance set to −1, inference was performed using BCELoss and CELoss. The experimental results are shown in Figure 7.

The performance of BCELoss and CELoss follows a similar pattern to that observed on the public dataset mentioned earlier. The optimal recall rates for both sets of experiments are shown in Table 2. The Recall Comparison on the Self-Built Dataset is shown in Figure 8.

Table 2. Comparison Results of Different Loss Functions.

Figure 8. Recall Comparison on the Self-Built Dataset (Drawn by the authors).

4.4.2. Comparison of Different Hyperparameters

The context length of a model refers to the longest text length that the model can encode, for BERT models, the context lengths are usually 256 and 512, longer context lengths can encode more complete semantic information of the text, but because the complexity of the self-attention mechanism is

O (n^{2})

, expanding the context length brings a square-fold increase in the arithmetic overhead. In this paper, we compare the effect of 256 and 512 context lengths on model recall.

RoBERTa model [31] is a series of improvements based on the Chinese BERT model, and the improvement methods include the use of dynamic masking strategy, more and higher quality training corpus, and finer parameter tuning. This model is the most widely used model in the Chinese domain, synthesizes various task scenarios, and has the best performance. The experiments in this paper replace the base BERT model of K-BERT with the RoBERTa model, expecting to obtain better classification results.

The number of training rounds, the amount of data, and the BatchSize together determine the number of training steps, and during the training process using the linear warmup strategy, the learning rate is affected by the current number of training steps, which in turn affects the final training result. In this paper, we optimize these hyper-parameter configurations and investigate the effects of different context lengths, base models, and number of training rounds on model recall.

Experiments in this section were conducted using BCELoss without configuring graphs to study the impact of different hyperparameters on the optimal recall rate. The experimental results are presented in Table 3.

Table 3. Comparison Results of Different Hyperparameters.

Comparing Experiment 1 and Experiment 2, it can be concluded that an increase in context length leads to a 2.29 percentage point improvement. Comparing Experiment 1 and Experiment 3, it can be concluded that increasing the epoch from 20 to 50 results in a 10.85 percentage point improvement. Comparing Experiment 3 and Experiment 4, it is evident that the RoBERTa model is 5 percentage points lower than the BERT model, indicating significant overfitting.

The overall impact of hyperparameters on recall rate can be summarized as follows:

-: After replacing the base model with RoBERTa, it converges faster but gets stuck at a local optimum without further improvement in recall rate. This is a significant overfitting phenomenon, primarily because RoBERTa has a larger number of parameters compared to BERT, making the model more complex and prone to overfitting.
-: Training rounds (epoch) at 20 and 50 show an 11 percentage point difference in the highest accuracy. This phenomenon can be explained by the fact that at epoch 50, due to the impact of warmup, the learning rate is relatively lower in the early stages of training compared to epoch 20, resulting in slower convergence. It is not easy to fall into a local optimum. In contrast, training at epoch 20 quickly converges to a local optimum, and thereafter, accuracy cannot be further increased.
-: A context length of 512 shows a 2.29 percentage point improvement compared to 256. This phenomenon can be explained by the fact that a longer context length more effectively encodes semantics.

4.4.3. Comparison of Different Hypernym-Hyponym Knowledge Trees

Figure 9 compares the recall rates with HowNet knowledge tree, divorce dispute type concept knowledge tree, and no knowledge tree at different rounds.

Figure 9. Recall Comparison with Different Hypernym-Hyponym Knowledge Trees (Drawn by the authors).

It can be seen that the recall rate difference brought about by the three types of knowledge trees is not significant, which is consistent with the conclusions drawn in the K-BERT experiments. Different knowledge graphs can have an impact of around 3 percentage points on the model’s performance.

The specific optimal recall rates for the three sets of experiments are shown in Table 4. Performance Comparison between Full Knowledge Graph and Spanning Tree Injection are shown in Table 5. Therefore, we conclude that the spanning tree extraction is a crucial and effective step in our framework. It successfully balances the trade-off between knowledge completeness and model efficiency, ultimately enhancing performance by reducing noise and focusing on the most relevant information.

Table 4. Comparison Results of Different Hypernym-Hyponym Knowledge Trees.

Table 5. Performance Comparison between Full Knowledge Graph and Spanning Tree Injection.

5. Conclusions

This study developed and validated a behavior-rule inference model based on K-BERT and hyponymy-hypernymy knowledge trees for the legal domain. Unlike generic applications, our work provides a tailored, systematic frameworkthat addresses the specific challenges of legal text reasoning through structured knowledge integration and component optimization.

Our experimental results demonstrate that the proposed framework achieves a robust accuracy of 78.85% under dynamic tolerance settingsand 93.42% with a tolerance of 10, indicating its strong potential for real-world legal assistance systems. More importantly, our findings offer deeper insights beyond performance metrics:

The marginal performance gap between domain-specific knowledge (83.7%) and generic knowledge (HowNet, 81.7%) suggests that knowledge precision and structural relevancemay be more critical than mere domain affiliation for legal inference tasks. This implies that carefully constructed generic knowledge bases can be highly competitive, while noisy or overly broad domain knowledge may provide limited benefits.

The optimization process reveals that context length extensionis crucial for capturing long-range dependencies in legal narratives, while the BCE loss functionis essential for handling the multi-label nature of legal rule mapping. The substitution of RoBERTa, while not yielding immediate gains, highlights the importance of further research into mitigating overfitting in complex, data-scarce legal domains.

Limitations and Future Work: This study primarily focused on divorce cases, and its generalizability across other legal domains requires further validation. Future work will include cross-domain testing (e.g., criminal law, contract disputes) and comparisons with more state-of-the-art legal AI baselines. Additionally, we will explore more sophisticated knowledge representation methods beyond static triples, such as conditional rules and logical expressions, to better capture the complexities of legal reasoning. The implementation details and code will be made publicly available to facilitate reproducibility and future research.

Experimental results indicate that domain-specific knowledge trees perform exceptionally well in legal text reasoning tasks, mainly due to the following reasons:

(1): The hierarchical structure of legal concepts is clear, making it suitable for representation through hypernym-hyponym relationships.
(2): Domain knowledge helps to eliminate ambiguities in legal texts.
(3): The method of knowledge injection aligns with the thinking patterns of legal experts.

However, this study also has some limitations:

(1): The experiment is limited to divorce cases and needs to be extended to other legal areas.
(2): The construction of knowledge trees relies on manual annotation; future work could explore automatic construction methods.

In conclusion, this work provides not only a practical framework for legal AI applications but also valuable insights into the effective integration of knowledge into pre-trained models, paving the way for more intelligent and reliable legal decision-support systems.

Author Contributions

H.Z. was responsible for designing the framework of the paper, J.G. was in charge of designing the algorithm and coding the program, H.J. took on the role of testing and analyzing the code, L.G. and K.W. handled the optimization and analysis of the code as well as refining the manuscript, and L.Q. contributed to the optimization of the algorithm. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

The datasets used in this article are all publicly available and do not involve any ethical issues or unauthorized use of data without the knowledge of others.

Data Availability Statement

The data can be obtained by contacting the author directly.

Conflicts of Interest

Author Huanlai Zhou was employed by the company Chengdu Quantum Matrix Technology Co., Ltd. Author Jianyu Guo was employed by the company Hefei iFLY Digital Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Catherine, C.; Kevin, L.; Klein, D. Constructing Taxonomies from Pretrained Language Models. arXiv 2010, arXiv:2010.12813v2. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MI, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. J. Mach. Learn. Res. 2018, 19, 1–48. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20), Vancouver, BC, Canada, 6–12 December 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Black, S.; Gao, L.; Wang, P.; Leahy, C.; Biderman, S. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow; Zenodo: Geneva, Switzerland, 2021; Volume 2022, p. 5297715. [Google Scholar] [CrossRef]
Wu, S.; Zhao, X.; Tong, Y.; Zhang, R.; Shen, C.; Liu, H.; Li, F.; Zhu, H.; Luo, J.; Xu, L.; et al. Yuan 1.0: Large-Scale Pre-Trained Language Model in Zero-Shot and Few-Shot Learning. arXiv 2021, arXiv:2110.04725. [Google Scholar]
Tang, Q.; Lv, X.; Li, Z. Research on Domain Ontology Concept Hyponymy Relation Extraction. Microelectron. Comput. 2014, 31, 68–71. [Google Scholar]
Li, J.; Lv, X.; Li, Z. Deriving Concept Semantic Hierarchy of Ontology in Patents. J. China Soc. Sci. Tech. Inf. 2014, 33, 986–993. [Google Scholar]
Zhang, C.; Zhang, J.; Zhang, Z.; Ruan, T.; He, P.; Ge, X. Construction of Large-Scale Disease Terminology Graph with Common Terms. J. Comput. Res. Dev. 2020, 57, 2467–2477. [Google Scholar]
Snow, R. Learning Syntactic Patterns for Automatic Hypernym Discovery. In Proceedings of the 17th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–11 December 2003; pp. 1297–1304. [Google Scholar]
Suchanek, F.M.; Kasneci, G.; Weikum, G. YAGO: A Large Ontology from Wikipedia and WordNet. J. Web Semant. 2008, 6, 203–217. [Google Scholar] [CrossRef]
Lu, K.; Li, Z.; Zhang, M. Data Construction and Benchmark Method Comparison for Chinese Hypernym-Hyponym Relation Classification. J. Xiamen Univ. Nat. Sci. Ed. 2020, 59, 1004–1010. [Google Scholar]
Huang, Y.; Wang, Q.; Liu, Y. A Method for Acquiring Domain-Specific Terminological Hyponymy Based on Conditional Random Fields. J. Cent. South Univ. Sci. Technol. 2013, 44 (Suppl. S2), 355–359. [Google Scholar]
Ma, X.; Guo, J.; Xian, Y.; Mao, C.; Yan, X.; Yu, Z. Entity Hyponymy Acquisition and Organization Combining Word Embedding and Bootstrapping in a Specific Domain. Comput. Sci. 2018, 45, 67–72. [Google Scholar]
Sun, J.; Li, Z.; Chen, W.; Zhang, M. Hypernym Relation Classification Based on Word Pattern Embedding. Acta Sci. Nat. Univ. Pekin. 2019, 55, 1–7. [Google Scholar]
Wu, T.; Li, M.; Kong, F. Construction of Textual Entity Hypernymy Corpus Based on Synonymy Reasoning. J. Chin. Inf. Process. 2020, 34, 38–46. [Google Scholar]
Wang, C.; He, X.; Gong, X.; Zhou, A. Word Embedding Projection Models for Hypernymy Relation Prediction. Chin. J. Comput. 2020, 43, 868–883. [Google Scholar]
Kozareva, Z.; Hovy, E. A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, USA, 9–11 October 2010; pp. 1110–1118. [Google Scholar]
Fu, R.J.; Qin, B.; Liu, T. Exploiting Multiple Sources for Open-Domain Hypernym Discovery. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1224–1234. [Google Scholar]
Liu, Q.; Xiao, Y.; Wang, W. A Recognition Approach of Typical Generic-Specific Relation for Massive Chinese Text. Comput. Eng. 2015, 41, 26–30. [Google Scholar]
Gan, L.; Wan, C.; Liu, D.; Zhong, Q.; Jiang, T. Chinese Named Entity Relation Extraction Based on Syntactic and Semantic Features. J. Comput. Res. Dev. 2016, 53, 284–302. [Google Scholar]
Cho, Y.; Rodriguez, J.D.; Gao, Y.; Erk, K. Leveraging WordNet Paths for Neural Hypernym Prediction. In Proceedings of the 28th International Conference on Computational Linguistics, Online, 8–13 December 2020; Available online: https://aclanthology.org/2020.coling-main.268 (accessed on 1 December 2020).
Liu, W.; Zhou, P.; Zhao, Z.; Wang, Z.; Ju, Q.; Deng, H.; Wang, P. K-BERT: Enabling language representation with knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 2901–2908. [Google Scholar]
Dong, Z.; Dong, Q.; Hao, C. How Net and the Computation of Meaning; Citeseer: University Park, PA, USA, 2006. [Google Scholar] [CrossRef]
Ma, Y. Design of Library Literature Recommendation System Based on Semantic Graph. Inf. Technol. 2023, 10, 147–151. [Google Scholar] [CrossRef]
Luo, Y. Research on Hierarchical Multi-Label Text Classification Algorithm Based on Deep Learning. Master’s Thesis, Sichuan University, Chengdu, China, 2021. [Google Scholar] [CrossRef]
Zhang, Q.; Yang, J.; Zhang, X. CS-Softmax: A Softmax Loss Function Based on Cosine Similarity. J. Comput. Res. Dev. 2022, 59, 936–949. [Google Scholar]
Duan, D.; Tang, J.; Wen, Y.; Yuan, K. Chinese Short Text Classification Algorithm Based on BERT Model. Comput. Eng. 2021, 47, 79–86. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]

Figure 1. Structure of Transformer model (Source: Authors).

Figure 2. Structure of BERT model (Source: Authors).

Figure 3. Architecture of the K-BERT model (Source: Reprinted from Liu et al. [25]).

Figure 4. Semantic tree structure (Source: Reprinted from Liu et al. [25]).

Figure 5. Transformation of an input sentence into a semantic tree via vision matrix (Source: Reprinted from Liu et al. [25]).

Figure 6. Mask-Transformer is a stack of Mask self-attention mechanisms (Source: Reprinted from Liu et al. [25]).

Figure 7. Recall comparison on the public dataset (Drawn by the authors).

Figure 8. Recall Comparison on the Self-Built Dataset (Drawn by the authors).

Figure 9. Recall Comparison with Different Hypernym-Hyponym Knowledge Trees (Drawn by the authors).

Table 1. Hyperparameter configurations for experiments.

Hyperparameter	Value	Hyperparameters	Value
Dropout	0.1	Max Epochs	50
Attention Dropout	0.1	Case Type	Divorce
batchsize	8	Learning Rate	0.00002
weight Decay	0.01	Tolerance	−1
Learning Rate Decay	Linear	Adam β1	0.9
Adamβ2	0.98

Table 2. Comparison Results of Different Loss Functions.

Dataset	Loss Function	Optimal Recall on Eval
Event Extraction	BCE Loss	0.943
Event Extraction	CE Loss	0.884
Divorce Dispute Mapping	BCE Loss	0.837
Divorce Dispute Mapping	CE Loss	0.797

Table 3. Comparison Results of Different Hyperparameters.

Num	Context Length	Base Model	Epochs	Optimal Recall on Eval
1	256	BERT	50	0.797
2	512	BERT	50	0.82
3	256	BERT	20	0.689
4	256	RoBERTa	20	0.634

Table 4. Comparison Results of Different Hypernym-Hyponym Knowledge Trees.

Dataset	Loss Function	Hypernym-Hyponym Knowledge Tree	Optimal Recall on Eval
Divorce Dispute Mappin	BCE Loss	HowNet	0.817
Divorce Dispute Mappin	BCE Loss	None	0.82
Divorce Dispute Mappin	BCE Loss	Divorce	0.837

Table 5. Performance Comparison between Full Knowledge Graph and Spanning Tree Injection.

Model Variant	Knowledge Structure	Accuracy (%)	Macro-F1
K-BERT (Ours)	Spanning Tree	83.7	0.819
K-BERT-Full	Full Graph	82.5	0.808
-	No Knowledge	82.0	0.801

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Behavior-Rule Inference Based on Hyponymy–Hypernymy Knowledge Tree

Abstract

1. Introduction

2. Theoretical Basis

2.1. Transformer

2.2. BERT

2.3. Extraction of Hierarchical Relationships

3. Results Model Construction

3.1. Overall Model Framewoek

3.2. Construction of Contextual Knowledge Tree

3.3. Knowledge Layer

3.4. Embedding Layer

3.4.1. Token Embedding

3.4.2. Soft Positional Embedding

3.4.3. Clause Embedding

3.5. Vision Layer

3.6. Mask-Transformer Layer

4. Experimentation

4.1. Dataset

4.2. Evaluation Metrics

4.3. Experimental Configuration

4.4. Comparative Experiments

4.4.1. Comparison of Different Loss Functions

4.4.2. Comparison of Different Hyperparameters

4.4.3. Comparison of Different Hypernym-Hyponym Knowledge Trees

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics