Convolutional Adaptive Network for Link Prediction in Knowledge Bases

: Knowledge bases (KBs) have become an integral element in digitalization strategies for intelligent engineering and manufacturing. Existing KBs consist of entities and relations and deal with issues of newly added knowledge and completeness. To predict missing information, we introduce an expressive multi-layer network link prediction framework—namely, the convolutional adaptive network (CANet)—which facilitates adaptive feature recalibration by networks to improve the method’s representational power. In CANet, each entity and relation is encoded into a low-dimensional continuous embedding space, and an interaction operation is adopted to generate multiple speciﬁc embeddings. These embeddings are concatenated into input matrices, and an attention mechanism is integrated into the convolutional operation. Finally, we use a score function to measure the likelihood of candidate information and a cross-entropy loss function to speed up computation by reducing the convolution operations. Using ﬁve real-world KBs, the experimental results indicate that the proposed method achieves state-of-the-art performance.


Introduction
Recently, knowledge bases (KBs) have become a key component in a variety of applications, including recommendation systems [1,2], engineering technologies [3,4], and intelligent conversation robots [5,6]. To help produce new information for these knowledgedriven applications, researchers have generated a variety of KBs [7][8][9] in the last several decades. KBs are intelligible domain engines that save the structural knowledge of human understanding. Furthermore, they easily integrate various data sources and give a structured semantic representation for computer analysis and knowledge inference.
KBs are multi-relational graphs that depict a large amount of concrete information. Different entities are represented by nodes, and the relations between entities are connected by edges, as seen in Figure 1. All components are grouped in the form of triplets, such as (head entity, relation, tail entity). For example, we know that Mercedes Benz is located in Germany, which can be represented as (Mercedes Benz, Located_in, Germany). Although many KBs now comprise billions of relations, entities, and triplets, they also have to deal with incompleteness and newly acquired real-world information.
To address the above concerns, studies have focused on link prediction, which attempts to forecast missing facts in knowledge bases. Knowledge graph embedding (KGE) [10] is a term used to describe existing link prediction methods. KGE is used to figure out how to represent relations and entities in their embedding representations. Previously, the embedding representation of a relation was mainly seen as a translational distance or semantic matching between entities. They are simple and feasible, and they have been demonstrated to be successful in the field of prediction tasks. However, compared to deep and multi-layer architectures, they learn fewer expressive features, which theoretically restricts their performance. A typical means of increasing the features and expressiveness of existing methods is that of increasing the embedding size. Thus, the total embedding parameters are proportional to the number of relations and entities. This approach does not scale to large KBs. To increase the method's expressiveness independently of the embedding size, recent studies have proposed several multi-layer methods [11][12][13] and reported commendable results on established datasets. Having adopted fully connected feature layers, these methods are prone to overfitting. One way to address the scaling and overfitting problems is to use parameter-efficient and fast operators that can be integrated into deep networks.
Convolutional neural networks (CNNs) have grown into a powerful technique among the many available technologies due to their significant performance in speech recognition, computer vision, and natural language processing. Based on the CNN algorithm, artificial intelligence has made significant breakthroughs in many fields [14,15]. The convolution operator is an effective parameter and can achieve rapid computing with highly optimized GPU implementations. Moreover, with the extensive usage of multiple robust methodologies, over-fitting control is efficient in the training of convolutional networks.
To provide a highly articulate link prediction system that effectively converges the advantages of CNNs, this paper proposes a convolutional adaptive network (CANet) method for KGE. First, all entities and relations are embedded into real-valued low-dimensional vectors, and an interaction operation is suggested to convert the initialized embeddings of the head entity and relation into multiple triplet-specific embeddings. Then, the specific embeddings of the head entity and relation are concatenated into 2D input matrices. In the convolution layer, CANet learns features that are different between the input matrices and allows the network to perform feature recalibrations. Finally, the feature map tensors are linearized and converted into the entity embedding space, and an inner product is calculated with the tail entity embedding to evaluate the correct triplet.
The following are the contributions of this paper: • A new link prediction method called CANet for learning the embedding representation of KBs is proposed. As a multi-layer network method, CANet takes advantage of the parameter-efficient and fast operators of CNNs, thereby generating additional expressive features and addressing the overfitting problem for large KBs. • An interaction operation is adopted to generate multiple triplet-specific embeddings for relations and entities. Thus, for various triplet predictions, the operation can simulate the selection of different information. Furthermore, multiple interaction embeddings provide rich representations and generalization capabilities. • To increase the representational capacity of CANet, an attention function is inserted into the convolutional operation. This functionality enables the network to achieve recalibration features adaptively. By learning to use global content, CANet can selectively accentuate features while suppressing pointless ones.
• To test the efficiency of our proposed approach, five real-world datasets were used in our experiments. In comparison to several previous approaches, the results show that CANet achieves state-of-the-art efficiency for general assessment metrics.
The following is how the remainder of the article is structured: In Section 2, we checked and discussed the related works. In Section 3, we defined the research problem systematically, and the CANet is described in detail. The experimental setup and results of the proposed method are detailed in Sections 4 and 5. Finally, the conclusions and suggestions for future research are presented in Section 6.

Related Works
The use of KGE methods to predict incomplete or incorrect triplets in KBs has been extensively studied in recent years. KGE aims to convert relation and entity semantics into continuous low-dimensional embedding spaces. These embeddings are then used to predict links. The previous approaches can be categorized as follows: translational distance methods and semantic matching methods. Over the last several years, there some multi-layer network methods have been proposed, and they achieved impressive link prediction performance.

Translational Distance Methods
The TransE [16] method is the most well-known translational distance method. For a given triplet (head entity, relation, tail entity), denoted as (h, r, t), TransE regards the relation as a translation from the head entity to the tail entity and assumes that h + r ≈ t, provided that the triplet is correct. The embedding representation is denoted by the bold letters h, r, and t, respectively. Simply and quickly, TransE achieves outstanding performance for link prediction in comparison with previous methods. However, the method has issues when modeling complex relations, including the N-1, 1-N, and N-N relations.
To address this issue, other strategies have been designed to split entities and relations into separate subspaces. For example, TransH [17] posits that entities should exhibit various characteristics in contrast to one another, and it projects the head and tail entities onto a hyperplane that is connected to the relation. Furthermore, TransR [18] considers that the embeddings of relations and entities should be in respective spaces. It projects the entity into the relation space by generating a projection matrix associated with each relation. TransD [19] further learns two projecting matrices to respectively project. Moreover, these matrices are initialized diagonally to reduce the method parameters. TorusE [20] embeds KB relations and entities in a lie group. The method points out that regularization is unnecessary during the training process if the embedding space is in a compact lie group. However, the results on several datasets showed that translational distance methods have weak improvements in terms of modeling complex relations. Furthermore, these methods only learn less expressive features, thus potentially limiting their performance.

Semantic Matching Methods
By matching the latent semantics, semantic matching methods calculate the likelihood of a triplet. The typical method, DistMult [21], uses a bilinear objective to learn embeddings. It has a bilinear product h M r t between the embeddings of a subject, a tail entity, and a diagonal matrix for relations, where h, t ∈ R d and M r ∈ R d×d . DistMult is a parameterefficient method that eliminates redundant operations. However, two triplets of (h, r, t) and (t, r, h) achieve the same product results.
To address this issue, by taking the conjugate of the tail entity's embedding, Com-plEx [22] applies DistMult to complex space and processes asymmetric relations. HolE [23] generates expressive and robust compositional representations by using circular correlations of embeddings. In ANALOGY [24], the analogical properties of relations and entities are further exploited, requiring that linear maps of relations be natural and mutually commutative. However, semantic matching methods have more redundancies compared with translational distance methods. Hence, the former is sensitive to overfitting, which becomes a challenge when KBs have a large number of relations and entities. As a result, a high-dimensional space is needed to completely embed and separate these entities and relations.

Multi-Layer Network Methods
The usage of a multi-layer network for link prediction has attracted increasing attention in recent studies. ProjE [11] uses brief but effectively shared variable networks to predict entities. Instead of measuring the distance between the components of the input triplet, ProjE projects candidate entities onto a target vector to represent the input data and order the vector value in descending order. R-GCN [12], which can be considered an autoencoder, was proposed for working with highly multi-relational data. The encoder generates latent function representations of relations and entities. To anticipate the named entities, the decoder component uses these representations. The graph convolutional network structure, on the other hand, is restricted to undirected graphs, whereas KBs are inherently oriented. Therefore, by combining the recurrent neural network with residual learning, RSNs [25] can effectively catch long-term relational dependencies within and between KBs. They bridge the gaps between entities by using a skipping mechanism. The experimental results provide some strong evidence that these multi-layer network methods learn more features compared with previous methods and can effectively enhance these methods' performance.
The convolutional neural network, which is commonly used in computer vision and has the advantages of parameter-efficient and fast operators, can be integrated into deep networks. We present a highly efficient CNN-based approach for relation prediction in KBs in this paper. As a result, we can catch a wide range of possible attributes of entities and relations. When compared to shallow architectures, the proposed approach will greatly boost performance by requiring only a small increase in computing resources.

Proposed Method
The majority of the notations and definitions in this paper are presented first, followed by an explanation of the link prediction task. After that, a detailed explanation of the proposed method is given. Finally, the method for training and optimization is discussed.

Problem Formulation
Formally, the knowledge base can be expressed as where E = e 1 , e 2 , · · · , e |E | and R = r 1 , r 2 , · · · , r |R| denote the set of entities and relations. The numbers of relations and entities are denoted by |R| and |E |, respectively. T ⊆ E × R × E denotes the set of triplets, and a triplet can be expressed as (h, r, t). The bold symbols h, r, t ∈ R d represent the d-dimensional vectors of h, r, t. R ∈ R |R|×d and E ∈ R |E |×d denote the matrices of relations and entities. The goal of link prediction is to forecast another entity based on one that has a specific relation; for example, using a head entity and relation (h, r) to predict a tail entity t. A general method for defining a score function ψ(h, r, t) ∈ R for the triplet is used to achieve this purpose. Most optimizations are intended to make a correct triplet outperform false triplets.

Outline of CANet
As shown in Figure 2, the CANet is a multi-layer convolutional network consisting of embedding combination, encoding, and scoring components. The vector of the embeddings of h and r is extracted by the encoding component. The score function ranks the feature vector and embedding of t in the scoring part. A summary of the details is given in the following.

Look Up:
We must first define the general embeddings of an input triplet (h, r, t) in the matrices of relations R and entities E. Such embeddings can be expressed in the following way: where i h , i r , and i t are the one-hot index vectors of h, r, and t. Interaction and Concatenation: To generate multiple triplet-specific embeddings for each relation and entity, we present an interaction embedding i r , which is connected to the relation. An interaction operation is defined to simulate the interaction effect of entities and relations as follows: where • is the element-by-element operator. We call h i and r i the interaction embeddings of h and r. They can provide generalization capabilities and rich representations. Then, the input matrices M can be obtained as where ⊕ denotes the concatenation operation. Convolutional Operation: Then, N separate kernels K = [k 1 , k 2 , . . . , k N ] ∈ R N×h×w execute a convolution process on the input layer M. The feature maps F r = [ f 1 , f 2 , . . . , f N ] ∈ R N×H×W are generated as follows: where denotes the action of convolution. To hold the translational property, we set the height h of convolutional kernels to 2 in the experiment. Attention Mechanism: The local receptive field is then used by each kernel k n . As a result, outside of the local receptive field, each output function map f n is unable to use contextual information. We can compact the global spatial information into a series of channel descriptors to solve this problem.
To alleviate this problem, we can compress the global spatial information into a set of channel descriptors c = [c 1 , c 2 , . . . , c N ] ∈ R N as follows: Then, the weights w = [w 1 , w 2 , . . . , w N ] ∈ R N of each channel are learned through the reduction and expansion operation as where σ(x) is a sigmoid function, and ReLU(x) represents the rectified linear units [26]. W a ∈ R N τ ×N and W b ∈ R N× N τ are the weights of dimensionality increase and reduction, respectively. The τ is the proportion of the dimension reduction.
Lastly, the remapping of feature maps F with channel interdependence can be identified asF = F · w = f n × w n , c = 1, 2, . . . , N The design of the attention mechanism is suggested as a solution to the issue of channel interdependence. It adaptively recalibrates the channel dimension characteristic reaction by specifically modeling channel interdependence and enhancing the network capacity.
Projection and Matrix Multiplication: Then, by flattening and translating the corresponding remapping feature maps into a d-dimensional space, the matrix W ∈ R N·H·W×d is allowed to establish the hidden layer. The tail entity embeddings t are multiplied with hidden layer vectors, and the CANet score function ψ(h, r, t) can be written as follows: where vec(x) denotes the flattening operation. Sigmoid: The plausibility of the triplet (h, r, t) can be written as follows: where σ(x) = 1/1 + exp(−x) describes a mathematically interpretable prediction of whether the triplet (h, r, t) is valid or not, and b is the bias term. The scoring procedure in our method uses the 1-N scoring strategy to reduce the convolution operations and accelerate the computation. Unlike the 1-1 scoring, the pair (h, r) is scored with respect to all tail entities simultaneously. In addition to the 1-1 scoring, the 1-N scoring allows for a substantial reduction in preparation and training time while still improving accuracy.

Training Objective
The observed knowledge base G and all parameters Θ in CANet were required to optimize the probability function in the following way: where all parameters in our method are represented by Θ. If the triplet (h, r, t) in G is right, we expect it to have a score of 1; otherwise, it will have a score of 0. Then, the Bernoulli distribution is used to describe the probability function: in which y = 1 for (h, r, t) ∈ T 0 for (h, r, t ) ∈ T (13) where T is a collection of false triplets collected by disrupting the correct triplet set T . Given Equations (11)-(13), the loss function of CANet can be written as follows: We utilize the dropout [27] technique on many stages to regularize the process, such as the input matrices and the hidden layer. After each layer, the batch normalization [28] technique can be used to stabilize and improve the convergence rate. Furthermore, we adopt the label smoothing [29] technique to improve the generalization and reduce overfitting. Finally, the loss function is optimized by the Adam optimizer [30], which is a computationally effective technique for gradient-based optimization.

Evaluation Metrics
The test set can be denoted by T test = ∆ 1 , ∆ 2 , . . . , ∆ |T | . For the i-th test triplet ∆ i , we created a potentially incorrect ∆ h i / ∈ T (or ∆ t i / ∈ T ) by changing the head (or tail) entity with any other entity. The method was then evaluated to see if it gave a top score to correct triplets and a poor value to incorrect triplets. According to the core function ψ(h, r, t) of our method, the left and right ranks of the i-th test triplet ∆ i were identified with corrupting head or tail entities and denoted as: where P[x] is just a predictor function that returns to 1 if the hypothesis x is valid and 0 if it is not. To evaluate the prediction performance, two general rating metrics were used. These were MRR and Hits@k, which are described as follows: where Hits@k is the number of ranks that are smaller than or equal to k, and MRR is the mean reciprocal rank of all test triplets. In all examples, a high MRR or Hits@k provides superior performance.

Datasets
The extensive evaluation was carried out on the FB15k, FB15k-237, WN18, WN18RR, and YAGO3-10 KB datasets. All of them contain a large number of entities and relations and are composed of training, validation, and test sets. Table 1 shows the details of all of the datasets. • YAGO3-10: YAGO3-10 [13] is a subset of YAGO [9]. There are 37 various relations with 123,182 entities. Each entity has a minimum of 10 relations connected to itself. It defines a person's characteristics, such as their occupation, ethnicity, and citizenship.

Comparison Methods
We used many baseline approaches from the area of link prediction to verify the efficiency of the proposed method. Translational distance, semantic matching, and multi-layer network approaches are examples of these methods. The following are the descriptions of these methods: • TransE: TransE [16] is one of the most common link prediction models for knowledge bases. To model multi-relational data, TransE transforms entities and relations into embedding spaces and regards relations as a translation from the head entity to the tail entity.

Experimental Implementation
The hyper-parameters for the CANet in the experiments were chosen by using a grid search throughout training. The following are the hyper-parameter ranges: input matrices and hidden layer dropout rate [0.1, 0. Both datasets performed well with the following hyper-parameters: relation and entity dimension sizes of 200, batch size of 128, and label smoothing of 0.1. For FB15k-237 and WN18RR, we set the learning rate to 0.0001, the input matrix dropout rate to 0.4, the hidden layer dropout rate to 0.4, and the kernel size to 2 × 7; for FB15k, WN18, and YAGO3-10, we set the learning rate to 0.001, the input matrix dropout rate to 0.2, the hidden layer dropout rate to 0.3, and the kernel size to 2 × 5. We used the program library PyTorch [31] to apply our methods and reproduce DistMult, ComplEx, and ConvE on the PC with an NVIDIA GTX 2070S.

Experimental Results
Tables 2-4 present the results of the link prediction tasks. The second-best score is underlined and the best score is in bold. We observed the following: • The proposed methods surpassed the other baseline methods in most instances. For the majority of the metrics, CANet achieved state-of-the-art results. As a result, the proposed methods could produce impressive link prediction results by using the CNN's parameter-efficient and fast operators. In comparison to the optimal methods on the WN18RR, FB15k-237, and YAGO3-10 datasets, CANet improved the MRR by 4.3%, 5.8%, and 14.5%, respectively. This proved the efficacy and applicability of our method. • Semantic matching methods are simple to learn and could be used to represent a knowledge base through the training process. However, because of their redundancy, they were highly susceptible to overfitting, resulting in lower performance than that of the proposed approaches. • Because the former created the weights of convolutional kernels that were relevant to each reference, CANet surpassed ConvE on all metrics. As a result, multiple relations may be used to obtain various feature maps. As a result, CANet is better at identifying the differences between entities.
To summarize, the proposed link prediction method is highly expressive. The proposed method had considerable improvements for all evaluation metrics in KBs because of its structure. It could also improve the efficiency by generating particular convolutional kernel weights to study different properties with the entity embeddings for each relation.

Influence of Kernel Size
To see how important multi-scale kernel sizes are, we evaluated the effects of different convolutional kernel sizes on the efficiency of our methods. The performance is depicted in Figure 3. The abscissa represents kernel sizes, with 2D sizes that included 2 × 3, 2 × 5, 2 × 7, and 2 × 9. The ordinate refers to the Hits@10 results. With a kernel size of 2 × 7, excellent results were obtained for the FB15k-237 and WN18RR datasets, with decreasing results for smaller and larger kernel sizes. For a kernel size of 2 × 5, excellent results were obtained for the FB15k and WN18 datasets, with decreasing results for smaller and larger kernel sizes. As a result, a reasonable kernel size normally provides adequate results when convolving the input matrices. The number of interactions can be limited by smaller kernel sizes. Larger kernel sizes, on the other hand, can result in overfitting. As a result, the proposed method convolved the entity and relation embeddings to achieve the experimental results shown in this paper. The appropriate kernel sizes used in this paper were 2 × 5 and 2 × 7. Figure 3. The Hits@10 performance with various kernel sizes on four datasets.

Parameter Sensitivity
(1) Label Smoothing: When measuring loss values, overfitting can be avoided by utilizing a method called label smoothing. Accordingly, several comparison experiments on the WN18 and FB15k datasets were extended to reveal the importance of these parameters in the results of CANet. Figure 4 depicts the outcomes. When we set the label smoothing value to 0.1, CANet outperformed both datasets for the Hits@10 results. As a result, label smoothing could increase the efficiency of our method to some degree. Increasing the label smoothing value resulted in underfitting. Thus, it is recommended that the impact of label smoothing should be analyzed on a per-dataset basis. (2) Dropout Rate: To improve the generalization potential of our approaches, we used the dropout strategy. On four datasets, we performed multiple experiments with dropout rates ranging from 0.0 to 0.5 to validate the influence of various dropout rates. The findings of our research are depicted in Figure 5. We focused on how different input matrices and a hidden layer dropout rate affected CANet's Hits@10 results. The findings show that using the dropout strategy provided better results than those obtained without using it. As a result, the dropout approach has the potential to significantly increase the method's efficiency. For the FB15k and WN18 datasets, the input matrix dropout rate of 0.2 and hidden layer dropout rate of 0.3 were the highest. However, for the WN18RR and FB15k-237 datasets, the input matrix dropout rate of 0.4 and hidden layer dropout rate of 0.4 were the best. This result could be due to the fact that the first two datasets contained reversible relations, which made prediction easier.

Conclusions
In this paper, for link prediction in KBs, we proposed a novel multi-layer convolutional adaptive network method called CANet. CANet makes use of CNN operators that are parameter-efficient and fast. It can also produce additional expressive capabilities and solve the issue of overfitting in large KBs. A scoring technique and an optimizer are used to speed up method training. To reduce overfitting and increase generalization, we use the dropout and label smoothing methods. On several real-world datasets, the proposed method effectively outputted expressive features and obtained new state-of-the-art performance according to the experimental results. We also investigated the responsiveness of various parameters and performed additional research.
In the future, we plan to study link prediction tasks for temporal knowledge bases. As we know, growing knowledge often contains temporal dynamics. This issue requires new methods for modeling such dynamic facts. Furthermore, the applications of knowledge bases are also meaningful. In addition, many industrial applications can be enhanced by knowledge bases, such as in question answering, recommendation systems, and information retrieval.

Conflicts of Interest:
The authors declare no conflict of interest.