Knowledge Graph Completion Using a Pre-Trained Language Model Based on Categorical Information and Multi-Layer Residual Attention

Rao, Qiang; Wang, Tiejun; Guo, Xiaoran; Wang, Kaijie; Yan, Yue

doi:10.3390/app14114453

Open AccessArticle

Knowledge Graph Completion Using a Pre-Trained Language Model Based on Categorical Information and Multi-Layer Residual Attention

by

Qiang Rao

¹,

Tiejun Wang

^2,*

,

Xiaoran Guo

²,

Kaijie Wang

¹

and

Yue Yan

¹

Key Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, China

²

School of Mathematics and Computer Science, Northwest Minzu Univsersity, Lanzhou 730030, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(11), 4453; https://doi.org/10.3390/app14114453

Submission received: 10 March 2024 / Revised: 4 May 2024 / Accepted: 9 May 2024 / Published: 23 May 2024

(This article belongs to the Special Issue Advancements in Natural Language Processing, Semantic Networks, and Sentiment Analysis)

Download

Browse Figures

Versions Notes

Abstract

Knowledge graph completion (KGC) utilizes known knowledge graph triples to infer and predict missing knowledge, making it one of the research hotspots in the field of knowledge graphs. There are still limitations in generating high-quality entity embeddings and fully understanding the contextual information of entities and relationships. To overcome these challenges, this paper introduces a novel pre-trained language model-based method for knowledge graph completion that significantly enhances the quality of entity embeddings by integrating entity categorical information with textual descriptions. Additionally, this method employs an innovative multi-layer residual attention network in combination with PLMs, deepening the understanding of the joint contextual information of entities and relationships. Experimental results on the FB15k-237 and WN18RR datasets demonstrate that our proposed model significantly outperforms existing baseline models in link prediction tasks.

Keywords:

knowledge graph completion; pre-trained language models (PLMs); textual information; categorical information; multi-layer residual attention

1. Introduction

The goal of knowledge graph completion (KGC) is to find missing entities or relation links in an existing knowledge graph, thereby enhancing and expanding the knowledge graph by predicting missing information from known triplets. Currently, there are two main approaches in this field. The first approach is structure-based KGC, where entities and relations are represented as embedding vectors, and the matching scores for each triplet are computed by modeling their interactions. Typical models include TransE [1] and RotatE [2]. Although structure-based methods have made some progress, they perform poorly in predicting long-tail entities and struggle to obtain effective embedding representations for entities that appear infrequently during training. Additionally, these methods only use entity and relation embeddings for link prediction, without fully leveraging the semantic information of entities and relations to enrich their embedding representations. The second approach is text-based KGC, with notable models such as KG-BERT [3], StAR [4], and C-LMKE [5]. This approach combines textual descriptions of entities and relations with pre-trained language models to enrich the representation of entities and relations in the knowledge graph, thereby alleviating the performance issues of structure-based methods in handling long-tail entities and relations. However, this approach has two main problems. (1) The issue of entity embedding quality: Existing methods rely solely on combining textual descriptions with pre-trained language models to generate entity embeddings, which may not fully capture the semantic information of entities. (2) Insufficient understanding of context information between entities and relations: Current text-based methods typically transform the textual descriptions of entities and relations into their contextual embeddings using language models. However, these methods employ different strategies to enhance the embedding representations of entities and relations, overlooking the complex interactions between entities and relations, thereby struggling to capture the complex semantic correlations between entities and relations.

To address these issues, this paper proposes a knowledge graph completion method that combines entity category information with an attention mechanism. The main contributions can be summarized as follows:

(1): Proposing a novel text-based contrastive learning model called CAKGC, which incorporates both entity category semantic information and textual descriptions to more accurately model entity embedding representations.
(2): Design a new attention network called Multi-layer Residual Attention Network (MRAN), which combines MRAN with pre-trained language models. The multi-layer self-attention mechanism allows the model to iterate and propagate information between entities, better capturing the complex relationships between entities. By simultaneously learning multiple attention heads, the model can model different types of relationships in parallel, enhancing the representation capability for diverse relationships. Each attention head can focus on different relationship patterns, capturing finer-grained features of relationships.

2. Related Works

2.1. Structure-Based Approach

Existing structure-based knowledge graph completion methods can be classified into two main categories: (1) distance-based methods and (2) tensor-decomposition-based methods.

In distance-based methods, TransE [1] is one of the earliest proposed models. It projects entities and relations into a vector space and treats relations as translation operations from head entities to tail entities. However, TransE’s scoring function is too simplistic to effectively represent the semantic differences between different entities under the same relation. To address this issue, TransH [6] introduces the concept of relation hyperplanes, which allow for different representations of the same entity under different relations. TransR [7] defines different mapping matrices for different relation categories, but this leads to a large number of model parameters. To tackle the parameter explosion problem, TransD [8] decomposes the relation mapping matrix. Addressing the inability of TransE to handle symmetric relations, RotatE [2] represents relations as rotational operations in the complex space, enabling better learning and differentiation of symmetric relations. PairRE [9] projects relations into the embeddings of head and tail entities and further models sub-relations based on RotatE while simplifying the complexity of extending the complex space.

Tensor-decomposition-based methods were initially proposed by RESCAL [10]. They optimize the product of head entity vector

v_{h}

, relation matrix

M_{r}

, and tail entity vector

v_{t}

to satisfy the equation

v_{h}^{T} M_{r} v_{t} = T_{h r t}

, and the loss function is optimized to maximize the scoring function

v_{h}^{T} M_{r} v_{t}

. The drawback of the RESCAL algorithm is that it sets a separate matrix for each relation, resulting in a large number of parameters. To address this parameter issue, DistMult [11] constrains the relation matrix to be a diagonal matrix, reducing the model parameters and significantly improving computational efficiency. To handle asymmetric relations, ComplEx [12] introduces complex operations in matrix multiplication. HolE [13] computes circular correlation values between head and tail entities, where circular correlation operations provide advantages in computational efficiency and space utilization. SimplE [14] addresses the inability of DistMult to model asymmetric relations by introducing separate distributed representations for entities and relations, resulting in fewer parameters compared to RESCAL. Inspired by automated machine learning, Zhang et al. propose an algorithm called AutoBLM [15] for automatically designing and discovering better scoring functions in knowledge graph embeddings.

As shown in Table 1, both methods only learn some structural information in the knowledge graph through triples. For long-tail entities and relations, embedding representations learned through structure-based methods often produce poor completion results.

2.2. Text-Description-Based Approach

Research has shown that pre-trained language models, such as the Transformer [16]-based bidirectional encoder BERT [17], can significantly improve the performance of natural language processing tasks with limited data. Calderón-Suárez R [18] introduced a novel data augmentation method that utilizes lyrics to improve the generalization ability of the method and enhance its performance. For long-tail entities and relations in knowledge graphs, traditional methods that utilize structural information often yield randomly initialized entity and relation representations, which leads to a sharp decline in reasoning performance on long-tail entities. To better leverage the rich semantic knowledge in pre-trained language models to represent long-tail data, KG-BERT [3] was proposed as the first model to fine-tune pre-trained language models for knowledge graph completion tasks. This model treats triplets as textual sequences and uses the output probabilities from BERT to judge the truthfulness of triplets. Building upon KG-BERT, Pretrain-KGE [19] introduces a general training framework consisting of three stages: semantic-based fine-tuning, knowledge extraction, and KG training. It learns the structural information of triplets while retaining some of the knowledge from BERT to learn better knowledge graph embeddings. Inspired by masked language modeling (MLM), MEM-KGC [20] is proposed by BONGGEUN CHOI et al. This method first masks the tail entity and treats the head entity and relation as the context for the tail entity. Then, it predicts the masked entity from all entities, effectively capturing the characteristics of unseen entities. KG-BERT only encodes the textual descriptions of entities and relations, overlooking the structural information in the knowledge graph. To address this, StAR [4] by Bo Wang et al. combines text and structure information. This model adopts a Siamese network structure with a pre-trained language model as the encoder to fully utilize the textual information of entities and relations. To overcome the limited number of negative samples and the issue of some entities lacking textual descriptions, LMKE [5] by Xintao Wang et al. introduces contrastive learning into text-based reasoning methods. The embeddings of entities within a batch are used as negative samples for each other, avoiding the extra cost of encoding negative samples. For entities without textual descriptions, LMKE jointly uses entity names, relation names, relation textual descriptions, and other information to represent their embeddings, enriching the representation of entities with missing information. SimKGC [21] argues that the key to model performance lies in an efficient negative sampling method for contrastive learning. Therefore, this method designs three types of negative sampling: within-batch negative sampling, batch-level negative sampling, and a simplified form of self-negative sampling for hard negatives. These methods have brought new ideas and performance improvements to knowledge graph embedding and completion tasks.

3. CAKGC

In this section, we elaborate on the proposed knowledge graph completion method using a pre-trained language model based on categorical information and multi-layer residual attention (CAKGC). The overall architecture is depicted in Figure 1. The model consists of two main components: an encoder and a decoder. The encoder takes the textual descriptions and category information about entities and relations as input and encodes them using a dual encoder. The decoder takes the encoded sequences and feeds them into the improved attention network MRAN, which dynamically adjusts the attention focus based on different contextual information. Finally, the triplet’s score indicating its truthfulness is computed using a two-layer MLP.

3.1. Encoder

The CAKGC model employs a dual encoder architecture. The two encoders are initialized using the same pre-trained language model, BERT, but they do not share parameters. Given a triplet

(h, r, t)

, we consider the case of predicting the tail entity. Specifically, the input of the first encoder is defined as follows in Equation (1):

x_{1} = h h_{d e s} [C L S] h_{t y p e} r r_{d e s} [M A S K]

(1)

where h and r represent the head entity and relation of a triplet and

h_{d e s}

,

r_{d e s}

,

h_{t y p e}

denote the textual descriptions of the head entity, textual description of the relation, and the category information of the head entity, respectively. Compared to the textual description sequences, category information sequences are typically much shorter. In the context of modeling long sequences, the shorter category information sequences may not provide sufficient contextual information as the model needs to process a large amount of information. In such cases, the model may tend to prioritize processing the textual description information while neglecting the category information, leading to a performance decline.

To assist the model in better understanding long sequences and guide the model to pay more attention to the category information, we prepend a special token

[C L S]

before the category information. Additionally, we adopt the Masked Entity Modeling (MEM) approach by replacing the tail entity with a special token

[M A S K]

. The input sequence x1 is then fed into the pre-trained language model BERT to obtain the output

y_{1} \in R^{H}

for the first encoder, which represents the embedding of the masked tail entity. This is formally represented as Equation (2).

y_{1} = B E R T_{1} {(x_{1})}_{m a s k}

(2)

Similarly, for the second encoder, we introduce the input of entity category information, which is represented as Equation (3):

x_{2} = t t_{d e s} [C L S] t_{t y p e}

(3)

where t,

t_{d e s}

, and

t_{t y p e}

represent the tail entity, textual description of the tail entity, and the category of the tail entity, respectively. To emphasize the importance of category information, we also prepend a special token

[C L S]

before the category information. The input sequence

x_{2}

is fed into BERT, and the output

y_{2} \in R^{H}

from the second encoder is obtained. It represents the embedded representation of the tail entity encoded by the pre-trained language model BERT, as shown in Equation (4).

y_{2} = B E R T_{2} {(x_{2})}_{t}

(4)

For a given triplet

(h, r, t)

, we utilize two encoders to obtain the embedding representations of the head entity and tail entity. The first encoder encodes the textual description of the head entity, its category information, and the textual description of the relation, resulting in the embedding representation,

y_{1}

, for the head entity–relation pair. The second encoder jointly encodes the textual description and category information of the tail entity, obtaining the embedding representation,

y_{2}

, for the tail entity. Thus,

y_{1}

and

y_{2}

can be regarded as positive sample pairs for the entities. Additionally, we can replace the tail entity, t, with other entities from the same batch and encode them using the second encoder, obtaining multiple negative sample pairs. We aim to maximize the similarity between positive sample pairs and minimize the similarity between negative sample pairs. Through contrastive learning between positive and negative sample pairs, the model can learn more accurate representations of entities and relations. Furthermore, by jointly encoding the textual description and category information, we can significantly enrich the representation of entities, especially for entities with shorter textual descriptions.

3.2. Decoder

In the decoder part, the embedding representations

y_{1}

from the first encoder, which encodes the masked tail entity, and

y_{2}

from the second encoder, which encodes the tail entity, are concatenated and expanded for similarity calculation. This calculation aims to determine the probability that the current triplet is a positive triplet (i.e., the probability that the candidate entity or relation is the true entity or relation). Specifically, given the inputs

y_{1}, y_{2} \in R^{H}

, they are expanded to form the input sequence

z_{1} \in R^{4 \times H}

as shown in Equation (5).

z_{1} = [y_{1}, y_{2}, y_{1} - y_{2}, y_{1} * y_{2}]

(5)

The similarity calculation consists of two parts: (1) MRAN and (2) MLP. The MRAN component is utilized to capture complex relationships between entities, resulting in improved entity and relation embeddings. Subsequently, two linear layers are employed to obtain the similarity scores between the current entity pairs.

MRAN. To effectively capture key information from the context of entities and relations, inspired by ResiDual, a novel attention mechanism called Multi-Layer Residual Attention Network (MRAN) is designed, as shown in Figure 2. Specifically, each layer consists of a residual attention module. The initial inputs for MRAN are

u_{0} = z_{1}

and

v_{0} = z_{1}

. For the k-th layer of MRAN, the output is given by

\begin{matrix} u^{k} = LayerNorm (u^{k - 1} + Attn (u^{k - 1})) \end{matrix}

(6)

\begin{matrix} v^{k} = v^{k - 1} + Attn (u^{k - 1}) \end{matrix}

(7)

where

u_{k - 1}

and

v_{k - 1}

are the outputs of the k−1th layer, representing the inputs to the kth layer. Attn refers to the attention mechanism. To better capture complex relationships within the input sequence, we employ a variant of multi-head self-attention that utilizes learnable transformations. Specifically, for the attention part of the kth layer of MRAN, linear transformations are applied to the input

u_{k - 1}

.

\begin{matrix} Q^{k} = L i n e a r_{1}^{k} (u^{k - 1}) \end{matrix}

(8)

\begin{matrix} K^{k} = L i n e a r_{2}^{k} (u^{k - 1}) \end{matrix}

(9)

\begin{matrix} K^{k} = L i n e a r_{3}^{k} (u^{k - 1}) \end{matrix}

(10)

The

Q^{k}, K^{k}, V^{k}

are divided into h heads. For the i head, where

Q_{i}^{k}, K_{i}^{k}, V_{i}^{k}

\in R^{4 \times h s}

, the weighted output

o u t_{a t t n_{i}^{k}} \in R^{4 \times h s}

is calculated as follows:

\begin{matrix} a t t n_{i}^{k} = S o f t m a x (\frac{Q_{i}^{k} K_{i}^{k^{T}}}{\sqrt{h s}}) \end{matrix}

(11)

\begin{matrix} o u t_{a t t n_{i}^{k}} = a t t n_{i}^{k} \cdot V_{i}^{k} \end{matrix}

(12)

where

a t t n_{i}^{k}

represents the normalized attention scores for the i head and hs denotes the dimensionality of the i head. The final weighted output

o u t^{k} \in R^{4 \times H}

is obtained by concatenating the weighted outputs from all h heads and applying a linear transformation as follows:

\begin{matrix} o u t_{a t t n^{k}} = [o u t_{a t t n_{1}^{k}}; o u t_{a t t n_{2}^{k}}; \dots o u t_{a t t n_{h}^{k}}] \end{matrix}

(13)

\begin{matrix} o u t^{k} = L i n e a r_{4}^{k} (o u t_{a t t n^{k}}) \end{matrix}

(14)

To alleviate the issues of gradient vanishing or exploding during network training, MRAN incorporates residual connections. Additionally, to accelerate model training and facilitate convergence, MRAN introduces layer normalization. For the final layer of MRAN, the outputs are denoted as

u^{N}

and

v^{N}

. The output

v^{N}

is layer-normalized and added to

u^{N}

to obtain the final output

z_{2} \in R^{4 \times H}

.

z_{2} = u^{N} + L a y e r N o r m (v^{N})

(15)

MLP. For computational convenience,

z_{2} \in R^{4 \times H}

is reshaped into

z_{3} \in R^{4 H}

, and the RELU activation function is applied to introduce non-linearity to the model. To further enhance the model’s performance, the degree information of the head entity

d_{h}

and tail entity

d_{t}

is incorporated. After passing through two layers of MLP, the probability

p (h, r, t)

of the input triplet being true is obtained.

\begin{matrix} z_{4} = R E L U (L i n e a r_{5} ([z_{3}; d_{h}; d_{t}])) \end{matrix}

(16)

\begin{matrix} p (h, r, t) = L i n e a r_{6} (z_{4}) \end{matrix}

(17)

Loss. To reduce the time complexity of the model, we adopt the negative sample construction technique from LMKE. Within the same batch, we replace the tail entity of the current triplet with other entities within that batch to form a set of negative triplets. For a single triplet

(h, r, t)

within the current batch, when predicting the tail entity, we utilize a self-adversarial negative sampling loss function as shown in Equation (18):

\begin{matrix} \begin{matrix} L (h, r, t) = & - l o g (p (h, r, t^{+})) - \frac{e x p (p (h, r, t^{-}))}{\sum_{t \in T^{-}} e x p (p (h, r, t))} \cdot l o g (1 - p (h, r, t^{-})) \end{matrix} \end{matrix}

(18)

Here,

t^{+}, t^{-}

represent the positive and negative samples of the tail entity, respectively, and

T^{-}

denotes the set of negative samples for the tail entity.

4. Experiment

4.1. Datasets

To evaluate the effectiveness of the proposed method in this paper, we employed two public datasets, FB15k-237 and WN18RR, as well as a Tibetan Thangka dataset specifically constructed for Thangka knowledge graph completion.

FB15k-237 is a dataset based on the Freebase knowledge graph, containing 14,951 entities, 237 relations, and 310,116 triplets. It is a subset of FB15k and is designed to achieve a more balanced dataset by removing some low-frequency relations. The dataset provides three parts: a training set with 272,115 triplets, a validation set with 17,535 triplets, and a test set with 20,466 triplets.

WN18RR is a dataset based on the WordNet knowledge graph, consisting of 40,943 entities, 11 relations, and a total of 90,343 triplets. The triplets are divided into training, validation, and test sets. The training set contains 86,835 triplets, the validation set contains 3034 triplets, and the test set contains 3134 triplets. This dataset is commonly used to evaluate the performance of knowledge graph completion algorithms. WN18RR is a simplified and revised version of the WN18 dataset, designed to increase the difficulty of training models and better reflect real-world applications.

The Thangka domain knowledge graph, constructed by our project team, describes the history, figures, events, era, style, genre, material, and craftsmanship of Thangka, a special form of painting art. Due to the semi-automatic and manual construction process, there are issues such as incomplete entities, insufficient relationships, and limited knowledge. Therefore, we propose the CAKGC knowledge completion method and validate it on the Thangka dataset. The dataset consists of 3491 triplets, including 1399 entities and 13 relations. It is divided into a training set (2793 triplets), a validation set (349 triplets), and a test set (349 triplets) in a ratio of 8:1:1.

4.2. Experimental Setup and Evaluation Metrics

During the training process, we utilized BERT-base [17] and BERT-tiny [22] as pre-trained language models. BERT-base is a pre-trained deep bidirectional Transformer model with 110 million parameters, consisting of 12 Transformer encoder layers and 768 hidden units. On the other hand, BERT-tiny is a smaller-scale pre-trained bidirectional Transformer model. Compared to BERT-base, BERT-tiny has a smaller model size with only two Transformer encoder layers and 10 million parameters, making it suitable for resource-constrained environments. We trained the CAKGC model with the BERT-base version on an NVIDIA GeForce RTX 3090 GPU (NVIDIA, Santa Clara, CA, USA) and the BERT-tiny version on an NVIDIA Tesla K80 GPU (NVIDIA, Santa Clara, CA, USA).

In the entity prediction task of knowledge graph completion, MRR, Hits@1, Hits@3, and Hits@10 are commonly used evaluation metrics to measure the performance of models. They are defined as follows:

MRR: MRR is calculated by sorting the candidate entity list for each query entity in descending order based on the predicted probabilities from the model. Then, the reciprocal rank of the correct entity for each query entity in the candidate entity list is computed.

M R R = \frac{1}{| S |} \sum_{i = 1}^{| S |} \frac{1}{r a n k_{i}}

(19)

where N is the number of candidate entities and

r a n k_{i}

represents the rank of the correct entity in the candidate entity list based on the descending order of probabilities.

Hits@k: Hits@k is calculated by sorting the candidate entity list for each query entity in descending order based on the predicted probabilities from the model. Then, it checks if the correct entity is present within the top k predicted entities for each query entity. Hits@k focuses solely on whether the correct entity is among the top k predicted entities.

H i t s @ k = \frac{1}{| S |} \sum_{i = 1}^{| S |} Γ (r a n k_{i} \leq k)

(20)

where N represents the number of candidate entities, and

I (x)

takes values from the set

\{0, 1\}

. Specifically, when

r a n k_{i} \leq k

, I takes the value of 1; otherwise, I takes the value of 0.

4.3. Comparison with Baseline Model Performance

CAKGC (BERT-tiny) and CAKGC (BERT-base) were compared with existing structurally based and textually based methods in the link prediction task. The structurally based methods included TransE, RotatE, DistMult, and AutoBLM. The textually based methods included Pretrain-KGE, KG-BERT, StAR, MEM-KGC, C-LMKE, and SimKGC.

Two versions of the CAKGC model, based on BERT-tiny with 10 million parameters and BERT-base with 110 million parameters, were employed for link prediction on two popular publicly available datasets, FB15k-237 and WN18RR. Table 2 and Table 3 present the results, showing that both versions of the CAKGC model achieved state-of-the-art performance on the FB15k-237 dataset and demonstrated competitive performance on the WN18RR dataset. Specifically, the CAKGC (BERT-tiny) model achieved MRR, Hits@1, Hits@3, and Hits@10 scores of 0.416, 0.335, 0.451, and 0.571, respectively, on the FB15k-237 dataset. Compared to the previous best-performing model C-LMKE (BERT-tiny), the CAKGC (BERT-tiny) model improved the scores by 1%, 1.6%, and 0.6%, respectively. On the WN18RR dataset, the CAKGC (BERT-base) model performed second best, trailing the SOTA model SimKGC. Compared to the C-LMKE (BERT-base) model, the CAKGC (BERT-base) model achieved improvements of 3.8%, 6.4%, and 2.3% in MRR, Hits@1, and Hits@3, respectively. Furthermore, compared to C-LMKE (BERT-tiny), the CAKGC (BERT-base) model demonstrated improvements of 3.4%, 3.5%, 2.7%, and 1.2% in MRR, Hits@1, Hits@3, and Hits@10, respectively.

From the results presented in Table 3, it can be observed that the performance improvement of our model on the FB15k-237 dataset is not as significant as that on the WN18RR dataset. We attribute this to two main reasons. (1) Compared to the WN18RR dataset, the FB15k-237 dataset has a higher average degree for each entity, indicating richer neighborhood information for entities in the FB15k-237 dataset. We can see that even advanced structure-based methods like AutoBLM achieve good performance on FB15k-237, indicating that modeling only textual descriptions is insufficient, and the internal structural information of the knowledge graph is equally important. (2) The FB15k-237 dataset contains a large number of two-hop relations, while the WN18RR dataset only consists of one-hop relations. Therefore, link prediction on FB15k-237 is more challenging, as the information obtained solely from textual descriptions is limited and cannot support better reasoning for two-hop or even multi-hop relations.

In the link prediction experiments conducted on the Tangka dataset we constructed, we compared our model with C-LMKE (BERT-base). According to Table 4, our model achieved MRR, Hits@1, Hits@3, and Hits@10 scores of 0.612, 0.566, 0.635, and 0.713, respectively, on the Tangka knowledge graph completion dataset. Compared to C-LMKE (BERT-base), our model demonstrated improvements of 1.9%, 3.2%, 1.2%, and 2% in MRR, Hits@1, Hits@3, and Hits@10, respectively.

4.4. Ablation Experiment

To demonstrate the effectiveness of incorporating entity category information and the Multi-Relation-aware Multi-hop Attention Network (MRAN), we conducted ablation experiments, as shown in Table 5. The experiments consisted of four groups: (1) CAKGC model, (2) CAKGC model without entity category information, (3) CAKGC model without MRAN, and (4) CAKGC model without both entity category information and MRAN. The results in Table 5 indicate that the model’s performance decreased when MRAN and entity category information were removed separately. This demonstrates the effectiveness of incorporating entity category information and the MRAN in improving the model’s performance.

5. Discussion

5.1. Entity Visualization

To qualitatively assess the effectiveness of the addition of category information in our proposed model, we focused specifically on the three categories with a higher number of instances in the WN18RR dataset. We randomly selected 100 entities from each category and used the second encoder in our model, which incorporates category information (the CAKGC model), to obtain embeddings for these entities. In Figure 3, it is clearly observable that entities from different categories are well separated in the space. This result not only demonstrates the high quality of the entity embeddings learned by our model, but, more importantly, it proves that the addition of category information significantly enhances the model’s performance capabilities.

5.2. Sparse Entity Prediction

For entity prediction on the sparse knowledge graph of WN18RR, we utilized the BERT-tiny pre-trained language model. The test set entities were divided into ten groups based on their degree values. Specifically, the ith group, denoted as

b o t h_{i}

, consisted of entities with degree values ranging from

2^{i}

to

2^{(i + 1)} - 1

. From Figure 4, it can be observed that CAKGC outperforms C-LMKE in terms of Hits@1 for sparse entity prediction, demonstrating a significant improvement.

5.3. Attention Mechanism

For the attention component in MRAN, we explored three attention mechanisms: Pre-Layer Normalization attention (Pre-LN), Post-Layer Normalization attention (Post-LN), and Residual Bi-directional Layer Normalization attention (ResiDual), as illustrated in Figure 5. The attention layers were stacked three times (N = 3).

From Table 6, it can be observed that ResiDual outperforms the traditional Pre-LN and Post-LN attention mechanisms. In Pre-LN, the normalization is performed before the self-attention, which means that the normalization heavily depends on the input. In Post-LN, the input of each sub-layer is the sum of the outputs from all preceding sub-layers. Therefore, any errors in the preceding sub-layers can accumulate and propagate to subsequent sub-layers, potentially leading to error accumulation. ResiDual, with its bidirectional attention mechanism, effectively addresses both of these issues, resulting in further performance improvement.

6. Conclusions

In order to overcome the limitations of text-based methods in understanding the contextual information between entities and relations and improving the quality of entity embeddings in knowledge graph completion, this paper proposes a novel contrastive learning method called CAKGC. CAKGC significantly improves the representation quality of entity embeddings by introducing the category of semantic information of entities. In addition, we also designed a novel attention mechanism called a multi-relation-aware multi-hop attention network (MRAN) to more accurately capture the contextual information between entities and relations. Experimental results on the FB15k-237 and WN18RR datasets show that our model is one of the best-performing models. Nevertheless, the generalization ability of the model when processing a wider range of datasets and its dependence on advanced hardware still need further study. Future work will explore optimizing computational efficiency and extending the model to adapt to more diverse knowledge graphs. In summary, the CAKGC method provides an effective solution for the field of knowledge graph completion, promoting the depth and accuracy of entity and relation understanding.

6.1. Future Work

(1) SimKGC achieves the best performance on the WN18RR dataset, mainly due to the design of a high-quality negative sampling strategy. In future work, we will focus on constructing high-quality negative samples to further improve the performance of CAKGC on WN18RR. (2) Recent studies have shown that MLP architectures perform comparably to attention architectures in the field of image processing, while attention mechanisms have significantly higher time complexity than MLP. To reduce the model’s time complexity, future work will explore designing an MLP architecture to replace the attention mechanism in MRAN, with the expectation of achieving comparable or even better performance. (3) The performance improvement of CAKGC on FB15k-237 is not as significant as on WN18RR, possibly due to its text-based nature, which fails to consider the rich contextual information of entities. In future work, we plan to incorporate graph neural networks to capture the neighborhood information of entities and further enhance the model’s performance by effectively integrating the entity’s neighborhood information with text descriptions and category information.

6.2. Limitations of the Research

(1) Model complexity: Our method CAKGC has two versions, BERT-base and BERT-tiny. CAKGC (BERT-base) uses NVIDIA RTX 3090, which has high video memory requirements. This reliance on high-end hardware may limit the deployment of the model in environments with limited computing resources and may increase the cost of research and practice. (2) Dependence on the quality of text descriptions: The performance of the model may rely heavily on the quality of the text descriptions of entities and relations. If these descriptions are uninformative or biased, it may affect the quality of the embedding and the predictive power of the model. (3) Limitations of the attention mechanism: Although multiple layers of residual attention networks are used, the attention mechanism may still have difficulty capturing all key contextual information, especially in large-scale graphs with complex relations or a large number of entities. (4) Completeness and accuracy of entity category information: The performance of the model depends in part on the completeness and accuracy of the entity classification information. If the category information is incomplete or wrong, it may affect the performance of the model.

Author Contributions

Conceptualization, Q.R. and Y.Y.; Methodology, Q.R.; Validation, Q.R. and K.W.; Writing—original draft, Q.R.; Writing—review & editing, T.W. and X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the 2023 Basic Research Operating Expenses of Central Universities (No. 31920230175), National Natural Science Foundation of China (No. 62166035), Natural Science Foundation of Gansu Province (No. 21JR7RA163).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; Yakhnenko, O. Translating Embeddings for Modeling Multi-relational Data. In Advances in Neural Information Processing Systems 26, Proceedings of the 27th Annual Conference on Neural Information Processing Systems 2013, Lake Tahoe, NV, USA, 5–10 December 2013; Curran Associates, Inc.: Red Hook, NY, USA, 2013. [Google Scholar]
Sun, Z.; Deng, Z.H.; Nie, J.Y.; Tang, J. Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv 2019, arXiv:1902.10197. [Google Scholar]
Yao, L.; Mao, C.; Luo, Y. KG-BERT: BERT for knowledge graph completion. arXiv 2019, arXiv:1909.03193. [Google Scholar]
Wang, B.; Shen, T.; Long, G.; Zhou, T.; Wang, Y.; Chang, Y. Structure-Augmented Text Representation Learning for Efficient Knowledge Graph Completion. Proc. Web Conf. 2021, 2021, 1737–1748. [Google Scholar]
Wang, X.; He, Q.; Liang, J.; Xiao, Y. Language Models as Knowledge Embeddings. In Proceedings of the International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022. [Google Scholar]
Wang, Z.; Zhang, J.; Feng, J.; Chen, Z. Knowledge Graph Embedding by Translating on Hyperplanes. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, Québec, QC, Canada, 27–31 July 2014. [Google Scholar]
Lin, Y.; Liu, Z.; Sun, M.; Liu, Y.; Zhu, X. Learning Entity and Relation Embeddings for Knowledge Graph Completion. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015. [Google Scholar]
Ji, G.; Liu, K.; He, S.; Zhao, J. Knowledge Graph Completion with Adaptive Sparse Transfer Matrix. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
Chao, L.; He, J.; Wang, T.; Chu, W. PairRE: Knowledge graph embeddings via paired relation vectors. arXiv 2020, arXiv:2011.03798. [Google Scholar]
Nickel, M.; Volker, T.; Hans-Peter, K. A Three-Way Model for Collective Learning on Multi-Relational Data. In Proceedings of the 28th International Conference on Machine Learning, Bellevue, DC, USA, 28 June–2 July 2011. [Google Scholar]
Yang, B.; Yih, W.; He, X.; Gao, J.; Deng, L. Embedding Entities and Relations for Learning and Inference in Knowledge Bases. arXiv 2014, arXiv:1412.6575. [Google Scholar]
Théo, T.; Johannes, W.; Sebastian, R.; Éric, G.; Guillaume, B. Complex Embeddings for Simple Link Prediction. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
Nickel, M.; Rosasco, L.; Poggio, T. Holographic Embeddings of Knowledge Graphs. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
Kazemi, S.; Poole, N. Simple embedding for link prediction in knowledge graphs. In Advances in Neural Information Processing Systems 31, Proceedings of the Annual Conference on Neural Information Processing Systems 2018, Montréal, QC, Canada, 3–8 December 2018; Curran Associates, Inc.: Red Hook, NY, USA, 2018. [Google Scholar]
Zhang, Y.; Yao, Q.; Kwok, J.T. Bilinear Scoring Function Search for Knowledge Graph Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1458–1473. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of naacL-HLT, Minneapolis, MN, USA, 6–7 June 2019. [Google Scholar]
Calderón-Suárez, R.; Ortega-Mendoza, R.M.; Montes-Gómez, M.; Toxqui-Quitl, C.; Márquez-Vera, M.A. Enhancing the Detection of Misogynistic Content in Social Media by Transferring Knowledge from Song Phrases. IEEE Access 2023, 11, 13179–13190. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, X.; Zhang, Y.; Su, Q.; Sun, X.; He, B. Pretrain-KGE: Learning knowledge representation from pretrained language models. In Findings of the Association for Computational Linguistics; Association for Computational Linguistics: Baltimore, MD, USA, 2020. [Google Scholar]
Choi, B.; Jang, D.; Ko, Y. MEM-KGC: Masked entity model for knowledge graph completion with pre-trained language model. IEEE Access 2021, 9, 132025–132032. [Google Scholar] [CrossRef]
Wang, L.; Zhao, W.; Wei, Z.; Liu, J. SimKGC: Simple Contrastive Knowledge Graph Completion with Pre-trained Language Models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022. [Google Scholar]
Iulia Turc, T. Well-read students learn better: On the importance of pre-training compact models. arXiv 2019, arXiv:1908.08962. [Google Scholar]

Figure 1. Overview of CAKGC.

Figure 2. Structure of MRAN.

Figure 3. Entity embedding t-SNE two-dimensional visualization on the WN18RR dataset.

Figure 4. Entity prediction of different degrees on the WN18RR dataset.

Figure 5. Different attention mechanisms.

Table 1. Comparison of distance-based and tensor-decomposition-based methods.

Method Category	Key Innovations
Distance-based methods	Introduce various improvements to manage the semantic representation of relations, including translation operations, relation hyperplanes, relation-specific mapping matrices, and complex rotational operations to better handle symmetrical relationships.
Tensor decomposition-based methods	Focus on matrix factorization techniques that optimize computational efficiency and parameter count, addressing asymmetry in relations and automating the design of scoring functions.

Table 2. Dataset statistics.

Dataset	Entities	Relations	Train	Valid	Test
FB15k-237	14,151	237	272,115	17,535	20,466
WN18RR	40,943	11	86,835	3034	3134
Thangka	1399	13	2793	399	399

Table 3. Link prediction experiments on the FB15k-237 and WN18RR datasets.

Method	FB15k-237				WN18RR
Method	MRR	Hits@1	Hits@3	Hits@10	MRR	Hits@1	Hits@3	Hits@10
TransE	0.279	0.198	0.376	0.441	0.243	0.043	0.441	0.532
RotatE	0.338	0.241	0.375	0.533	0.476	0.428	0.492	0.571
DistMult	0.241	0.155	0.263	0.419	0.430	0.390	0.440	0.490
AutoBLM	0.364	0.270	-	0.553	0.492	0.452	-	0.567
Pretrain-KGE	0.332	-	-	0.529	0.235	0.263	0.423	0.557
KG-BERT	-	-	-	0.420	0.216	0.041	0.302	0.524
StAR	0.296	0.205	0.322	0.482	0.401	0.243	0.491	0.709
${MEM-KGC}_{(B E R T_{b a s e})}$	0.346	0.253	0.381	0.531	0.557	0.475	0.604	0.704
${C-LMKE}_{(B E R T_{t i n y})}$	0.406	0.319	0.445	0.571	0.545	0.467	0.587	0.692
${C-LMKE}_{(B E R T_{b a s e})}$	0.404	0.324	0.439	0.556	0.598	0.480	0.675	0.806
SimKGC	0.336	0.249	0.362	0.511	0.666	0.587	0.717	0.800
${CAKGC}_{(B E R T_{t i n y})}$	0.416	0.335	0.451	0.571	0.579	0.512	0.614	0.704
${CAKGC}_{(B E R T_{b a s e})}$	-	-	-	-	0.636	0.544	0.698	0.795

Table 4. Link prediction experiments on the Thangka dataset.

Model	MRR	Hits@1	Hits@3	Hits@10
${C-LMKE}_{(B E R T_{b a s e})}$	0.593	0.534	0.623	0.683
${CAKGC}_{(B E R T_{b a s e})}$	0.612	0.566	0.635	0.703

Table 5. Ablation experiments of the CAKGC (BERT-tiny) model on the WN18RR dataset.

Model	MRR	Hits@1	Hits@3	Hits@10
CAKGC	0.579	0.512	0.614	0.704
CAKGC (w/o type)	0.571	0.502	0.606	0.698
CAKGC (w/o attn)	0.550	0.482	0.580	0.681
CAKGC (w/o type and attn)	0.545	0.467	0.587	0.692

Table 6. Comparison of entity prediction performance of the WN18RR data set under different attention mechanisms. The model uses CAKGC (BERT-tiny).

Model	MRR	Hits@1	Hits@3	Hits@10
CAKGC (Pre-LN-Attention)	0.568	0.499	0.602	0.693
CAKGC (Post-LN-Attention)	0.573	0.504	0.609	0.701
CAKGC (ResiDual-Attention)	0.579	0.512	0.614	0.704

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rao, Q.; Wang, T.; Guo, X.; Wang, K.; Yan, Y. Knowledge Graph Completion Using a Pre-Trained Language Model Based on Categorical Information and Multi-Layer Residual Attention. Appl. Sci. 2024, 14, 4453. https://doi.org/10.3390/app14114453

AMA Style

Rao Q, Wang T, Guo X, Wang K, Yan Y. Knowledge Graph Completion Using a Pre-Trained Language Model Based on Categorical Information and Multi-Layer Residual Attention. Applied Sciences. 2024; 14(11):4453. https://doi.org/10.3390/app14114453

Chicago/Turabian Style

Rao, Qiang, Tiejun Wang, Xiaoran Guo, Kaijie Wang, and Yue Yan. 2024. "Knowledge Graph Completion Using a Pre-Trained Language Model Based on Categorical Information and Multi-Layer Residual Attention" Applied Sciences 14, no. 11: 4453. https://doi.org/10.3390/app14114453

APA Style

Rao, Q., Wang, T., Guo, X., Wang, K., & Yan, Y. (2024). Knowledge Graph Completion Using a Pre-Trained Language Model Based on Categorical Information and Multi-Layer Residual Attention. Applied Sciences, 14(11), 4453. https://doi.org/10.3390/app14114453

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Knowledge Graph Completion Using a Pre-Trained Language Model Based on Categorical Information and Multi-Layer Residual Attention

Abstract

1. Introduction

2. Related Works

2.1. Structure-Based Approach

2.2. Text-Description-Based Approach

3. CAKGC

3.1. Encoder

3.2. Decoder

4. Experiment

4.1. Datasets

4.2. Experimental Setup and Evaluation Metrics

4.3. Comparison with Baseline Model Performance

4.4. Ablation Experiment

5. Discussion

5.1. Entity Visualization

5.2. Sparse Entity Prediction

5.3. Attention Mechanism

6. Conclusions

6.1. Future Work

6.2. Limitations of the Research

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI