A Dynamic Convolutional Network-Based Model for Knowledge Graph Completion

: Knowledge graph embedding can learn low-rank vector representations for knowledge graph entities and relations, and has been a main research topic for knowledge graph completion. Several recent works suggest that convolutional neural network (CNN)-based models can capture interactions between head and relation embeddings, and hence perform well on knowledge graph completion. However, previous convolutional network models have ignored the different contributions of different interaction features to the experimental results. In this paper, we propose a novel embedding model named DyConvNE for knowledge base completion. Our model DyConvNE uses a dynamic convolution kernel because the dynamic convolutional kernel can assign weights of varying importance to interaction features. We also propose a new method of negative sampling, which mines hard negative samples as additional negative samples for training. We have performed experiments on the data sets WN18RR and FB15k-237, and the results show that our method is better than several other benchmark algorithms for knowledge graph completion. In addition, we used a new test method when predicting the Hits@1 values of WN18RR and FB15k-237, named speciﬁc-relationship testing. This method gives about a 2% relative improvement over models that do not use this method in terms of Hits@1.


Introduction
Knowledge graphs are usually expressed in a highly structured form, where nodes denote the entities and edges represent different relations entities. This can be represented as a triples (h, r, t), where h and t stand for the head and tail entities respectively, and r represents the relation from h to t. At present, knowledge graphs have been widely used in many fields of artificial intelligence, such as automatic question answering [1], dialogue generation [2,3], personalized recommendation [4], and knowledge reasoning [5]. However, most open knowledge graphs, such as Freebase [6], Wikidata [7], and DBpedia [8], are constructed by automatic or semi-automatic methods. These graphs are usually sparse, and the implicit relationships among a large number of entities have not been fully mined. In Freebase, 71% of people do not have an exact date of birth, and 75% do not have nationality information. The incompleteness of knowledge graphs has become a major concern, and knowledge graph embedding [9] (KGE) is an effective solution to solve the problem of incompletion. Knowledge graph embedding generates low-dimensional embedding vectors of entities and relationships based on the existing triple relationships in the knowledge graph, and then inputs these low-dimensional embedding vectors into the score function, which can be a network to predict the missing relationships in the knowledge graph [10]. All models strive to make the positive triples score higher, and the negative triples score lower.
Recently, convolutional networks have been widely used to generate low-dimensional embedding vectors of entities and relationships in knowledge graphs, because convolutional networks can increase the interactions and capture interaction features between RESCAL [16], NTN [17], and HOLE [18] models are typical decomposition-based models. Both RESCAL and NTN use tensor products. These tensor products capture rich interactions, but require a large number of parameters to establish a relational model, so calculations are very troublesome. To overcome these shortcomings, HOLE uses cyclic correlation of entity embedding to create a more effective and scalable representation decomposition.
In contrast, translation models such as TransE [19], DISTMULT [20], and ComplEx [21] have proposed simple models. TransE regards the relationship in the knowledge graph as a kind of translation vector between entities. For each triple (h, r, t), TransE uses the vector R of relation r as the translation between the head entity vector H and the tail entity vector T. We can also regard R as the translation of H from T. DISTMULT uses a bilinear diagonal model to learn embedding, which is a special case of using bilinear projection in NTN and TransE. DISTMULT uses weighted element dot products to model entity relationships. ComplEx uses complex embeddings and Hermitian dot products to learn embedding. These translation models are faster, require fewer parameters, and are relatively easy to train.
Recently, many neural network models have been proposed for knowledge graph embedding. Recently, CNN-based methods have been proposed to capture the interactive interaction features with parameter efficient operators, such as ConvE [12], ConvKB [22], and InteractE [11]. ConvE reshapes the initial input into a matrix form and then uses 2D convolution to predict links. It consists of a convolutional layer, a fully connected layer, and an inner product layer for final prediction. Using multiple filters to extract global relationships can generate different feature maps. The concatenation of these feature maps represents input triples. ConvKB is an improvement of ConvE. ConvKB does not need to reshape the input and captures more interactive features through convolutional layer. Compared with ConvE, which captures local relationships, ConvKB retains translation features and shows better experimental performance. InteractE proved that capturing more interactions between head and relation embeddings is beneficial to the final prediction result, so InteractE reorders the initial vector and reshapes the feature combination into many matrices and then puts those matrices into the convolutional layer to obtain more multi-feature interaction. However, InteractE only increases the interaction between head and relation embedding without taking into account the fact that the importance of different interaction features is not the same. Therefore, we propose to use dynamic convolution to assign weights of varying importance to interaction features to solve this problem.

Our Approach
In this section, we first describe the background and definitions used in the rest of the paper in Section 3.1 and introduce our model in Section 3.2. Then we introduce the dynamic convolutional network in Section 3.3, and the method of mining hard negative samples in Section 3.4. Finally, we introduce the loss function used in Section 3.5.

Definition
Knowledge Graph, Knowledge Graph Embedding and Negative Sampling are defined as follows.
Definition 1 (Knowledge Graph). G = (ε, R), where ε, R indicate the entity set(nodes) and relationship set (edges) of the knowledge graph, respectively. A triple (h, r, t) is represented as the relation (edge) r ∈ R between head entity (node) h ∈ ε and tail entity (node) t ∈ ε in G.
Definition 2 (Knowledge Graph Embedding). Knowledge graph embedding aims to learn an effective representation of entities, relations, and a scoring function f , which can be a network, such that for a given input triple, v = (e h , e r , e t ), where e h , e r , e t indicate h, r, t embedding vectors. f (v) gives a v a higher score if v is valid. Therefore, we can predict the missing head entity h given query (?, r, t)m or tail entity t given query (h, r, ?), with the learned entity and relation embedding and the scoring function.
Definition 3 (Negative Sampling). Negative sampling generates negative triples by corrupting valid triples. Let K + = h j , r j , t j | j = 1, 2, · · · , N denote the complete knowledge graph, where h j , r j , t j represents the valid triple in the knowledge graph. Negative sampling produces sets of corrupted triples (negative triples) T = {(h, r, t ) | t ∈ ε, (h, r, t ) / ∈ K + }. During model training, negative sampling takes a certain number of negative triples from T for training.

Our Model
The overall framework of our model is shown in Figure 1. Table 1 presents the parameters of our model. Our model contains three dynamic convolutional layers, a flatten layer, and a fully connected layer (FC). The first dynamic convolutional layer (DConv1) has 64 filters and the size of the kernel is 3 × 3. The next two dynamic convolutional layers (DConv2, DConv3) have 128 and 256 filters, and the two kernels are both 3 × 3. For all dynamic convolutional layers, apply the same padding and size of stride. The calculation rule of the output is defined in Section 3.3.  For example, we have the valid triple p = (h, r, t) and the input query q = (h, r, ?). The embedding corresponding to h, r, t are e h , e r , e t , respectively, and the dimensionality of embeddings is 200. First, we use the Chequer method proposed in InteractE to randomly scramble and combine the head embedding (e h ) and relationship embedding (e r ) into a matrix: where φ chk denotes the Chequer method and v 1 ∈ R 20×20 . We use the Chequer method four times to obtain four different matrices, denoted as v 1 , v 2 , v 3 , and v 4 . These matrices have the same size. The Chequer method has proven to be effective in increasing interaction between head embedding and relationship embedding [11]. Next, the four different matrices are viewed as four channels of the input for the first dynamic convolutional layer (DConv1). We first use 3 × 3 dynamic convolution (proposed in Section 3.3) to capture interaction features and assign weights of varying importance to interaction features, which can be formulated as: where || denotes the concatenation operation, r denotes Relu activation function [23], * denotes the convolutional operation, [v 1 ||v 2 ||v 3 ||v 4 ] ∈ R 20×20×4 , Ω 1 is the parameter of the first dynamic convolutional layer (DConv1), and v 1 3×3 ∈ R 20×20×64 denotes the interaction features from the first dynamic convolutional layer (DConv1).
Second, we use two 3 × 3 dynamic convolutions to capture high-level interaction features and assign weights of varying importance to high-level interaction features: where v 1 3×3 is the input interaction features, Ω 2 is the parameter of the second dynamic convolutional layer (DConv2), Ω 3 is the parameter of the third dynamic convolutional layer (DConv3), r denotes Relu activation function [23], and v 3×3 ∈ R 20×20×256 denotes the output interaction features from the third dynamic convolutional layer (DConv3).
Then, the final output interaction features v 3×3 are flattened to 102,400 units and a fully connected layer is applied to obtain the predicted embedding of the given query: where W 1 ∈ R 102,400×200 is the parameter of the fully connected layer and r denotes Relu activation function [23]. Finally, in order to train the model, we need to sample some negative samples. First, we use our method (mining hard negative samples) to collect a small number of hard negative samples θ 1 of tail entity for the valid triple (h, r, t). Second, we use the normal negative sampling method to collect a large number of negative samples θ 2 of tail entity. Therefore, we obtain a negative tail entity set θ = θ 1 ∪ θ 2 . The label of t is 1 and the label of tail entity in θ 1 and θ 2 is 0. Then, we multiply the predicted tail embedding eˆt with the valid tail embedding e t and the tail embedding in the negative sample set θ to obtain the logits. We put logits into the loss function, intending to make the logits of the valid tail embedding e t larger and larger, and the logits of the tail embedding obtained from negative sampling θ smaller and smaller. The number of negative samples and the setting of hyperparameters for our model are in Section 4.2.

Dynamic Convolution
Most previous convolutional network models like InteractE [11] and ConvE [12] use traditional convolutional filters to capture interaction features between the head and relation embeddings; then, they treat these interaction features uniformly. However, we think that the different interaction features should have different contributions to the experimental results. In this paper, we are inspired by the concept of dynamic convolution, which was proposed in [13] and was used in image processing. We propose to employ dynamic convolution to assign weights of varying importance to interaction features. Different from [13], which computes weights over convolutional kernels, we compute weights over convolutional filters (output channels).
The goal of dynamic convolution is to learn a group of filter weights, which can assign weights of varying importance to filters. We illustrate the overall framework of dynamic convolution in Figure 2, and Table 2 presents parameters of one dynamic convolutional layer, which has C out filters and a kernel size of s 1 × s 2 . Dynamic convolution can build upon transformation mapping an input X ∈ R H×W×C in to feature maps (interaction features) Y ∈ R H ×W ×C out . In the notation that follows, we use V = [con v 1 , . . . , con v C out ] to denote the learned set of convolutional filters, where con v c ∈ R s 1 ×s 2 ×C in refers to the parameters of the c-th filter and s 1 , s 2 indicate the size of the convolutional kernel. Therefore, for the traditional convolutional layer, if the input is X ∈ R H×W×C in we can obtain output U = [u 1 , u 2 , . . . , u C out ]: where * represents the convolution operation, , and u c ∈ R H ×W . con v s c ∈ R s 1 ×s 2 is a 2D kernel representing a single channel of con v c that acts on the corresponding channel of X. To simplify the notation, bias terms are omitted. Table 2. Parameter settings for one dynamic convolutional layer. For dynamic convolutional layer, we apply squeeze-and-excitation [24] to compute filter weights. Firstly, the input X ∈ R H×W×C in is squeezed by the average pooling layer to become X 1 ∈ R 1×1×C in , such that the a-th element of

Size of Stride Number of Filters Size of Kernel Size of Padding Output Size
where x a ∈ R H×W is the a-th element in X. Then, we use two 1 × 1 convolutions (with a Relu between them) and softmax to generate normalized weights α = [α 1 , α 2 , . . . , α C out ] for convolutional filters V: where α ∈ R 1×1×C out , V 1 is the parameter of the Conv1, V 2 is the parameter of the Conv2, r denotes Relu activation function [23], and σ denotes the softmax function. Finally, the output Y = [y 1 , y 2 , . . . , y k ] of the dynamic convolutional layer can be formulated as: where * denotes convolution, y c ∈ R H ×W is the c-th element in Y, and α c ∈ R 1×1 is the c-th element in α. Therefore, the essence of dynamic convolution is to add different weights to the convolutional filters at different inputs, which can capture more important interaction features and allow the network to pay more attention to more important interaction features, instead of unimportant interaction features.

Mining Hard Negative Samples
Negative sampling generates negative triples by corrupting valid triples. Let K = {(h i , r i , t i ) | i = 1, 2, · · · , N} denote all valid triples in the knowledge graph, T + denote all entities in the knowledge graph, and We divide the negative triples T i into two grades. If the model easily distinguishes between a negative triple (h i , r i , t a ) ∈ T i and the valid triple (h i , r i , t i ), we named such a negative triple an easy negative sample. Using a large amount of easy negative samples will not achieve the purpose of training a better model. If the model hardly distinguishes between a negative triple (h i , r i , t b ) ∈ T i and the valid triple (h i , r i , t i ), we named such a negative triple a hard negative sample. Hard negative samples can increase the difficulty of the model during training and allow the model to pay more attention to details. Figure 3 shows the process involved in our method of mining hard negative samples. In order to effectively select hard negative samples, the method we use is to mine through the model itself. If model input is a valid triple (h i , r i , t i ), we firstly randomly select k negative triples from T i and obtain N i = {(h i , r i , n c ) | c = 1, · · · , k, (h i , r i , n c ) ∈ T i }. Next, we use valid triple (h i , r i , t i ) and N i to train the initial model for one epoch and obtain a trained model M 1 . Then, we use M 1 to predict the missing tail entity at (h i , r i , ?), and obtain logits about the position of all entities T + in this tail entity. We use logits to sort these tail entities from high to low: (h i , r i , e 1 ), (h i , r i , e 2 ), . . . , (h i , r i , e |T + | ). We define that triples ranked higher than r s are likely to be valid triples, triples ranked between r s and r t are hard negative triples (hard negative samples), and triples ranked lower than r t are easy negative triples (easy negative samples). Therefore, we obtain the set of hard negative samples P i = {(h i , r i , e c ) | c = r s , r s + 1, · · · , r t , (h i , r i , e c ) ∈ T i }. We can use the hard negative samples obtained from this epoch model to train the next epoch model. Finally, we merge P i to the negative sampling of (h i , r i , t i ) in the next epoch training. After that, this operation is repeated with the model obtained from the previous epoch of training.

Training Objective
For training the model parameters, we apply a logistic sigmoid to the logits of the scores of (h, r, t), and use the Adam optimizer [25] to train DyConvNE by minimizing the following binary cross-entropy loss: where p is the prediction and l is the label. For valid triples, we define its label as 1; for negative triples, it is defined as 0.

Experiments
In this part, we apply two public datasets, FB15k-237 and WN18RR, to validate the effectiveness of our proposed DyConvNE model for knowledge graph completion. Firstly, we give a detailed description of the datasets in Section 4.1 and an experimental setup of our model in Section 4.2. Then, we compare with other models to demonstrate the better performance of our model in Section 4.3 and we show the experimental results of our specific-relationship-testing method in Section 4.4. Finally, we present a case study in Section 4.5, an ablation study in Section 4.6, and a hard negative sampling study in Section 4.7.

Datasets
To evaluate our proposed approach, we used two benchmark datasets: WN18RR [12] and FB15K-237 [26]. WN18RR and FB15k-237 are the subsets of WN18 and FB15k after removing the inverse relations, respectively. Previous works suggest that the task of knowledge graph completion in WN18 and FB15K suffers from the problem of inverse relations, whereby one can achieve state-of-the-art results using a simple reversal rulebased model [11]. Therefore, corresponding subset datasets WN18RR and FB15k-237 were created to resolve the reversible relation problem in WN18 and FB15K. Table 4 shows the summary statistics of the datasets. The WN18RR dataset has 40,943 entities, 11 relations, and 86,835 triples. The FB15k-237 dataset has 14,541 entities, 237 relations, and 272,115 triples.

Experimental Setup
In the current epoch of model training, we use the model that has been trained in the previous epoch to sample a certain number of hard negative samples for each valid triple (h, r, t). We obtain the hard negative sampling set of head entity θ 1 and the hard negative sampling set of tail entity θ 2 . Then, we use the normal negative sampling method to collect a certain number of negative head entity sets θ 3 and negative tail entity sets θ 4 . We replace the head entity and tail entity in the original triple (h, r, t) with the entities obtained by negative sampling to obtain the two replaced triple sets (n h , r, t), where n h ∈ h ∪ θ 1 ∪ θ 3 , and (h, r, n t )l where n t ∈ t ∪ θ 2 ∪ θ 4 . The triple sets are put into the model for training, and label smoothing is used with a parameter of 0.1 for the training label. The parameters of our model are selected via grid search according to the MRR on the validation set. We select the initial learning rate from {0.001, 0.01, 0.1, 0.0002, 0.0001}, the dimensionality of embedding from {100, 200}, and the batch size from {64, 128, 256, 512}.
As shown in Table 5, for the FB15k-237 dataset, we use the Adam optimizer [25]. The initial learning rate we used was 0.001, and the dimensionality of embedding was 200. The learning rate decay strategy was used to decay 0.005 every 150 rounds. We trained the model up to 500 epochs with a batch size of 128 and a number of negative samples of 1070, including 70 hard negative samples. The selection range of hard negative samples was from the 30th to the 100th among the ranked entities. For the WN18RR dataset, the initial learning rate we used was 0.001, and the dimensionality of embedding was 200. The learning rate decay strategy was used to decay 0.005 every 150 rounds. We trained the model up to 500 epochs with a batch size of 256 and a number of negative samples of 5180, including 180 hard negative samples. The selection range of hard negative samples is from the 20th to the 200th among the ranked entities.

Main Results
We follow the "Filtered" setting protocol [19] to evaluate our model, i.e., ranking all the entities excluding the set of other valid entities that appeared in training, validation, and test sets. We use Mean Reciprocal Rank (MRR), Mean Rank (MR), and Hits@1 and Hits@10 indicators to evaluate our model. Note that lower MR, higher MRR, and higher Hits@1 or Hits@10 indicate better performance.
We have compared our results with various advanced methods: TransE [19], Dis-Mult [20], ComplEx [21], ConvE [12], ConvKB [22], CACL [27], SACN [28], InteractE [11]. The experimental results are summarized in Table 6. As shown in Table 6, DyConvNE gains significant improvements on FB15k-237 and WN18RR. On FB15k-237 dataset, the MRR of our model is 0.358, Hits@1 is 26.5, Hits@10 is 54.2, and MR is 181. On FB15k-237, our model obtains the best result on Hits@1 and Hits@10, and the second-best result on MRR and MR. On the WN18RR dataset, the MRR of our model is 0.474, Hits@1 is 43.5, Hits@10 is 55.2, and MR is 4531. On WN18RR, our model obtains the best result on Hits@1, MRR, and Hits@10, and a comparable result on MR. We conduct ablation experiments in Section 4.5 to further demonstrate the effectiveness of our proposed dynamic convolution and of mining hard negative samples.

Specific Relationship Testing
We experimented with our specific-relationship-testing method on FB15k-237 and WN18RR. The experimental results are summarized in Table 7. The traditional testing method is to replace either h or t with other entities in ε to create a set of invalid triples for each valid test triple (h, r, t) and then rank the valid test triple (h, r, t) and invalid triples in ascending order of their scores. If the valid test triple ranks first, the prediction is correct, and the total accuracy rate (Hits@1) increases. Our specific-relationship testing method considers the entity pairs connected by different relationships to be domain-specific. For example, if the relation r is place-lived, its tail entity must be a place name and not a person's name, age, etc. Therefore, in our specific-relationship-testing method, we replace either h 1 or t 1 with entities that should have appeared in the head entity or tail entity of the triple with relation r 1 in the training se, to create a set of invalid triples for each valid test triple (h 1 , r 1 , t 1 ). Then, we rank the valid test triple (h 1 , r 1 , t 1 ) and invalid triples in ascending order of their scores. Special attention is required when using the specific-relationship-testing method, since only Hits@1 can be compared to the traditional testing method because Hits@1 represents the accuracy rate. Other evaluation metrics (MR, MRR, and Hits@10) are not comparable because of the change in the number of invalid triples. In the experiment, we used the same trained model and different testing methods to predict the Hits@1 of FB15k-237 and WN18RR. DyConvNE uses the traditional testing method, and DyConvNE-SR uses the specific-relationship-testing method. As shown in Table 7, DyConvNE-SR obtained the highest Hits@1 on WN18RR and also the highest Hits@1 on FB15k-237. DyConvNE-SR achieved better results than DyConvNE, and DyConvNE-SR gained significant improvements of 2.8% in Hits@1 on WN18RR and 1.9% in Hits@1 on FB15k-237.

Case Study
To further analyze how the specific-relationship-testing method contributes to knowledge graph completion, we give two examples in Table 8. For the query (Pixar Animation Studios, artist, ?), target (Randy Newman) ranks fourth when using traditional testing methods, and first when using our method. The traditional testing method ranks John A. Lasseter, Pete Docter, and Andrew Stanton in the first, second, and third place. Still, these are animators, directors, and screenwriters, respectively, not artists. When using the specific-relationship-testing method, all predictions for the query are a set of entities with the type "artist". For the second example (?, profession, theatrical producer), target (Emanuel "Manny" Azenberg) ranks second when using the traditional testing method and first when using our method. The traditional testing method ranks The Shubert Organization first, but it is not a person and has no attribute of profession. When using our method, all predictions for the query are a set of entities with the type "person". These examples clearly show how our specific-relationship-testing method benefits the knowledge graph completion.

Ablation Study
In the ablation experiment, we ensured that the experiment was performed on FB15k-237 and WN18RR under the condition that the hyperparameters of the model DyConvNE, such as the learning rate, remained unchanged. First, we removed the method of mining hard negative samples in the DyConvNE, replaced the dynamic convolution in the model with traditional convolution, and obtained a general model DyConvNE-conv. As shown in Table 9, the MR of DyConvNE-conv on FB15k-237 is 186, the MRR is 0.353, and the @10 is 53.9. The MR of DyConvNE-conv on WN18RR is 5455, the MRR is 0.44, and the @10 is 51.6. Then, we only removed the method of mining hard negative samples in the model in DyConvNE and retained the dynamic convolution to get the model DyConvNE-dyconv. It can be observed that this model is better than DyConvNE-conv's MR, MRR, and @10 on both FB15k-237 and WN18RR, which proves the effectiveness of the dynamic convolution operation. Then, our original model DyConvNE-dyconv-neg, which includes dynamic convolution and mining hard negative samples, was compared to the DyConvNE-dyconv model The MR dropped by 4, the MRR increased by 0.002, and the Hits@10 increased by 0.2% on FB15k-237. DyConvNE-dyconv-neg also achieved better results on WN18RR. The ablation study proves that mining hard negative samples can make the model perform better.

Hard Negative Sampling Study
We present the results of our model on FB15k-237 in terms of Hits@10 in Figure 4 for rank (r s -r t ) ∈ {0th-0th, 10th-100th, 20th-100th, 30th-100th, 40th-100th}. We also present the results of our model on WN18RR in terms of Hits@10 in Figure 5 for rank (r s -r t ) ∈ {0th-0th, 10th-200th, 20th-200th, 30th-200th, 40th-200th}. The results show that the different rank (r s -r t ) have different effects on the two datasets. The rank 0th-0th denotes that the model only uses the traditional negative sampling method, which is equivalent to the DyConvNE-dyconv model in Section 4.6. Figures 4 and 5 show that using a suitable rank of mining hard negative samples can achieve better performance than models using the traditional negative sampling method. Our model achieved the best results on the FB15k-237 when the rank (r s -r t ) was set to 30th-100th, and our model achieved the best results on the WN18RR when the rank (r s -r t ) was set to 20th-200th.

Conclusions
In this paper, we propose a new knowledge graph completion model named DyCon-vNE. Our model first employs dynamic convolution instead of traditional convolution to assign weights of varying importance to interaction features. Then, our model uses a new method of generating negative samples (mining hard negative samples) to increase the difficulty of model training. We proved the effectiveness of our model through experiments. The Hits@10 of our model reached 55.2% on the WN18RR dataset and 54.2% on the FB15k-237 dataset, which was better than the previous knowledge graph completion model. Finally, we propose a specific-relationship-testing method, which can reduce the number of candidate entities when testing. Our specific-relationship-testing method can improve Hits@1 by 2% on WN18RR and FB15k-237. In the future, we intend to extend our method to introduce attribute information of entities and relationships into the knowledge graph embedding to improve the accuracy of the knowledge graph embedding.
Author Contributions: H.P., theoretical study, analysis, findings, manuscript writing; Y.W., design review, manuscript review, and supervision. All authors discussed the results and contributed to the final manuscript. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding. Data Availability Statement: All the benchmarking datasets used in this study can be downloaded using the following URL: https://figshare.com/articles/dataset/KG_datasets/14213894 (accessed on 27 December 2021).

Conflicts of Interest:
The authors declare no conflict of interest.