Rule-Guided Compositional Representation Learning on Knowledge Graphs with Hierarchical Types

: The representation learning of the knowledge graph projects the entities and relationships in the triples into a low-dimensional continuous vector space. Early representation learning mostly focused on the information contained in the triplet itself but ignored other useful information. Since entities have different types of representations in different scenarios, the rich information in the types of entity levels is helpful for obtaining a more complete knowledge representation. In this paper, a new knowledge representation frame (TRKRL) combining rule path information and entity hierarchical type information is proposed to exploit interpretability of logical rules and the advantages of entity hierarchical types. Speciﬁcally, for entity hierarchical type information, we consider that entities have multiple representations of different types, as well as treat it as the projection matrix of entities, using the type encoder to model entity hierarchical types. For rule path information, we mine Horn rules from the knowledge graph to guide the synthesis of relations in paths. Experimental results show that TRKRL outperforms baselines on the knowledge graph completion task, which indicates that our model is capable of using entity hierarchical type information, relation paths information, and logic rules information for representation learning.


Research Motivation
Knowledge graphs (KGs), such as Freebase [1], DBpedia [2], and NELL [3], are used to describe the relationship between things in the real world. KGs provide effective structured information and have been widely used in many fields, such as information retrieval [4,5] and question answering [6,7]. A typical knowledge graph usually stores facts in the form of triples (head, relationship, tail), denoted (h, r, t).
Even though many large KGs often contain billions of triples, they are still incomplete. Specifically, in DBpedia, 60% of individual entities do not indicate their place of birth [8].
Owing to the incompleteness of KGs, it is difficult for people to further apply them to certain scenarios, for example, in a question answering system in which incomplete questions will cause errors in the answers obtained. Therefore, the task of supplementing the missing parts of the KGs has become a top priority.
At present, most KGs completion methods are based on knowledge representation learning [9], which projects the entities and relationships in the triples into a lowdimensional continuous space. TransE [10] is one of the most classic KGs completion models and embeds entities and relationships into the same latent space. To better handle complex relationships, such as 1-to-N, N-to-1, and N-to-N, TransH [11] and TransR [12] use relation-specific hyper-planes and relation-specific spaces, respectively, to separate triples according to their correspondence. However, these models only focus on the triples themselves, ignoring the rich information located in the entity hierarchy types that can also be useful for obtaining a more complete entity representation. Entities have different types of representations in different scenarios. For example, a man can be the manager of a company or the father of a child, so entities with multiple types should be represented differently in different scenarios. In addition, relation paths in KGs can provide additional relationships for entity pairs. For instance, PTransE [13] successfully uses the relation paths information to obtain embedding of entities and relationships. However, in the present work, the embedding of relationships is randomly initialized, while the representation of paths is obtained by summing or multiplying relations in paths [14]. Since the representation of the path is obtained purely through numerical calculations in the latent space, errors will be propagated, thereby affecting the entire knowledge representation learning. To address this problem, we introduce logic rules with the expectation that the accuracy of logic rules can be used to improve the accuracy of relational path inference. At the same time, the interpretability of logical rules can also enhance the interpretability of relational path inference.
Specifically, we propose a knowledge representation learning framework that combines entity hierarchical type information and rule path information (TRKRL). Moreover, we introduce these bits of information into the embedding level. We regard the entity hierarchical type information as the entity's projection matrix and use a type encoder to model it for addressing the problem that entities have different types of representations in different scenarios. For relation paths information and logic rules information, we use Horn rules mined from KGs to guide the synthesis of relations in the path and improve the accuracy of relational path reasoning, while the interpretability of logic rules can also enhance the interpretability of a model's representation learning. We evaluate the TRKRL model on a benchmark dataset in Freebase, and experimental results show that compared with all baselines TRKRL exhibits a significant and consistent improvement. The main contributions of the present work can be summarized as follows. (1) We introduce logic rules information and use the accuracy of logical rules to improve the accuracy of relational path reasoning. At the same time, the interpretability of logical rules can also improve the interpretability of representation learning. (2) Entity hierarchical type information is introduced to obtain a more comprehensive representation of entities in order to cope with different scenarios in which the same entity has different types. (3) We propose a novel knowledge representation learning model that combines relation paths information, logic rules information, and entity hierarchical type information, while experiments show that our model outperforms all baseline approaches.

Translation-Based Models
In recent years, great progress has been made in the representation learning of KGs [15][16][17][18], and many models are based on translation operations. TransE [10] is the most classic and representative translation-based model. TransE first projects both entities and relationships into the same continuous low-dimensional vector space as h, r, t ∈ R s . The key operation of TransE is then to translate the semantics from head entities to tail entities by relationships. TransE believes that the tail t should be in the neighborhood of h + r; that is, h + r ≈ t when triple (h, r, t) holds. Hence, the energy function is E(h, r, t) = h + r − t . TransE is effective and simple for 1-to-1 relationships but has issues for modeling 1-to-N, N-to-1, and N-to-N relationships.
Some researchers have made efforts to solve the problem of the representation of complex relationships. TransH [11] interprets relations as translating operations on relationspecific hyper-planes, and projects h and r to the relation-specific hyper-plane. In this way, different embedded representations of entities are realized when the entities correspond to different relationships. TransR [12] first models entities and relationships in independent entity space and relationship space, and then maps entities from entity space to relationshipspecific space. STransE [19] puts the head and tail entities in different spaces on the basis of TransR. TransRHS [20] considers the inherent generalization relationships among relations.
However, these models only focus on the relationship between triples and ignore the rich information carried in the triples, which will be applied in our model TRKRL.

Multi-Source Information Learning Models
Multi-source information refers to textual information, type information, and logical information that can complete the triple structure. In terms of text information, Socher et al. [14] proposed representing the entity as the average value of its word embeddings in the entity name, so as to share the textual information of similar entities. Based on the entity name and Wikipedia anchor, Wang et al. [11] and Zhong et al. [21] encode entities and words into the joint vector space. DKRL [22] explores two encoders to represent the semantics of entity descriptions, and considers the zero-shot scenario, in which some entities are novel compared to existing KGs with only descriptions.
Hierarchical entity types information and logic rules information are also significant for KGs. Krompaß et al. [8] propose that the entity types comprise a hard constraint in the KG latent variable model. In order to realize the explicit coding of type information, Xie et al. [23] proposed the TKRL. TKRL considers the hierarchical structure of entity types and solves the problems of noise and incomplete types in hard constraints. The interpretability of logic rules enhances the interpretability of representation learning. For instance, Minervini et al. [24] simply impose equivalence and reversal constraints on relational embedding; Ruge [25] converts triples into complex formulas formed by atoms with logical connectives; Niu et al. [26] explicitly use Horn rules to derive path embeddings and create semantic associations between relationships. However, none of these approaches can simultaneously apply structured information, hierarchical types information, and logic rules information in the representation learning of the SG. The model TRKRL proposed in this paper performs well in fusing multi-source information and improves the interpretability and generalization of representation learning on the KG.

Extraction of Hierarchical Type Information
The fact that the same entity has different meanings at different levels of a scenario is important for the learning of representations in the KG. However, most previous research pays less attention to the rich information located in hierarchical types of entities. Figure 1 shows a triple instance; Isaac Newton has a variety of types (e.g., book/author, physical/physicist, and British/celebrity). It is, therefore, reasonable to believe that each entity should be represented differently in different scenarios, as a reflection of itself from different perspectives. Take the example of a hierarchy type g with m layers, where g (i) is the ith sub-type of g. Each sub-type g (i) has only one parent sub-type g (i+1) , while the most precise sub-type is the first layer, and the most general sub-type is the last. Going through the hierarchy from the bottom-up, we can obtain a representation of the hierarchical type as g = g (1) , g (2) , . . . , g (i) . We assign the type-specific projection matrix W g to each type g, and the head h and tail t of this relation are then represented in the projection under the particular types as g rh and g rt , respectively. The energy function is defined as follows: in which W rh and W rt are different projection matrices for h and t, respectively.

Type Encoder
We use a general form of type encoder to encode hierarchical type information into the representation learning. In the general form of a KG, most entities have more than one type, so the projection matrix W e for entity e is a weighted sum of all type matrices: where n is the number of types entity that e has, a i is the weight for g i , g i is the ith type that e belongs to, and W g i represents the projection matrix of g i . The weights can be set according to statistical characteristics, such as type frequency. With this operation, all projection matrices of entity e are the same in different scenes.
In practice, however, the importance of entity attributes varies in different scenarios. Therefore, we have improved the type encoder, and the projection matrix W rh in a specific triple will be: where 1 ≥ a i > 0. Similarly, the projection matrix W rt of the entity at the tail position can be obtained.

Hierarchical Encoder
As mentioned earlier, we consider the type information of entities to be hierarchical, so a recursive hierarchical encoder is used. During the projection process, entities (e.g., Isaac Newton) will be first mapped to the more general sub-type space (e.g., physical) and then be sequentially mapped to the more precise sub-type space (e.g., physical/physicist). The mathematical formula is: where W g is the projection matrix for type g, W g (i) is the projection matrix of the ith sub-type , and m is the number of layers for type g.

Extraction of Logic Rules Information
To enable our model to provide more semantic information, we have further fused paths and logic rules information. First, we mine the rules with their confidence levels µ ∈ [0, 1] from the KG. The higher the confidence level of the rule, the higher the probability that it holds. Second, we restrict the maximum length of rules to 2. Thus, rules are classified into two categories according to their length, as follows. (1) R 1 : A rule set of length 1 is called R 1 , which associates two relations in the rule body and rule head. (2) R 2 : A rule set of length 2 is denoted R 2 and it can be used to compose paths. Figure 2 provides specific examples. We use PTransE to implement the path extraction process, where the reliability of each path p is denoted as R(p|h, t) between pairs of entities (h, t). Table 1 lists the modes for R 2 . Obviously, it is crucial that, in the rule R 2 , which constitutes the path, sequential paths are formed by the atoms of each rule body. Therefore, we encode the eight rules to facilitate the formation of a valid path set P(h, t) for the entity pair (h, t). Taking the original rule r 3 (a, b) = r 1 (b, e) r 2 (a, e), for instance, we first convert the atom r 1 (b, e) into r −1 1 (e, b), and then exchange two atoms in the rule body to obtain a chain rule r 3 (a, b) = r 2 (a, e) r −1 1 (e, b), which could be further abbreviated as r 3 = r 2 r −1 1 .

Original Rules Encoded Rules
In order to make full use of the encoded rules, we should traverse the paths and iteratively perform the composition operation at the semantic level until the rules cannot combine any relations. In the actual path synthesis process, consider the optimal case in which all relations in the path can be synthesized iteratively by the rule R 2 and eventually joined together as a single relation between pairs of entities. In addition, when the path can match multiple rules at the same time, we choose the rule with a high confidence level to form the path. This leads to the path embedding H(p) of the path p.
When rule R 1 holds, relation r 1 may have more semantic similarity to its directly implicated relation r 2 . We, therefore, encode rules of the form of representation learning, (a, r 2 , b) = (b, r 1 , a) as (a, r 2 , b) = (a, r −1 1 , b). During training, embedding representing pairs of relations that appear simultaneously in rule R 1 are considered closer than embedding of two relations that do not match any rule.

Integration of Information
For each triple (h, r, t), we define three energy functions that model correlations for direct triples and hierarchical type methods, path pairs using rule R 2 , and relationship pairs using rule R 1 : where E 1 (h, r, t) measures the effectiveness of type information. E 2 (p, r) denotes the energy function evaluating the similarity between path p and relation r, and U(p) = µ 1 , . . . , µ n denotes the set of confidence levels corresponding to all rules in rule R 2 employed in the composition of path p. E 3 (r, r e ) is an energy function that represents the similarity between a relation r and another relation r e . If the relations contained in the relation r e are re-defined using rule R 1 , it should be assigned a smaller fraction.

Loss Function and Optimization
We formalize the loss function as a margin-based score function targeting negative sample sampling: where T represents a set that contains all the positive triples observed in KG. T is the negative sampling set of T, R r is the set of all relations deduced from r on the basis of rule R 1 , and r e is any one of the relations in R r . P(h, t) is the set of all paths connecting entity pair (h, t), of which p is a path. L 1 , L 2 , and L 3 correspond to marginal-based loss functions for the triple (h, r, t) of entity hierarchical types, path pairs (p, r), and relationship pairs (r, r e ): L 3 (r, r e ) = max(0, γ 3 + βE 3 (r, r e ) − E(r, r ), where γ 1 , γ 2 , and γ 3 > 0 are hyper-parameters; β denotes the confidence level of associations r and r e . Since there are no explicit negative triples in KGs, the entities or relationships in the training triples are randomly replaced by any other entity in E. Moreover, the new triples after replacements will not be considered as negative samples if they are already in T. In addition, the negative triples sampling rule is expressed as follows: For optimization, mini-batch stochastic gradient descent (SGD) is used to minimize the loss function. The projection matrix set W could be initialized randomly or by identity matrix. In addition, the embeddings of entities and relations could be either initialized randomly or be pre-trained by existing translation-based models, such as TransE.

Datasets
We evaluate our model on two typical KGs, i.e., FB15K and FB15K-237, which are extracted from the large-scale Freebase [1]. FB15K contains 14,951 entities, 1345 relations, and 592,213 triples in total, and we split the triples into training, validation, and testing sets.
We collected a total of 571 entity types in FB15K, with the average number of types for all entities being approximately eight and having at least one hierarchical type. However, in order to verify the validity of the logic rule information, the FB15K-237 dataset is also used in the experiment. Note that FB15K-237 contains no inverse relation; hence, it is difficult to learn embeddings by these mutually independent relations, so it is different than the FB15K dataset. The statistics of all datasets are listed in Table 2. We collect all type instances of type/instance fields in FB15K, as well as the relationshipspecific type information distributed in the rdf-schema#domain and rdf-schema#range fields. Regarding the logic rules information, we choose AMIE+ as the rule mining tool for its convenience and fast speed to mine rich information. We set the confidence threshold to be chosen in the range [0, 1] in steps of 0.1 to search for the best performance of the rule on the dataset.

Settings
TransE and TransR are the comparison objects of the proposed models. Considering the differences in application scenarios, we make changes in their original settings. We first use the L 1 -norm to improve the dissimilarity measure of TransE. Then, in the negative sampling process, we replace the relationship and the entity, and use "bern" to represent the head or tail of different probabilities %. Similarly, we perform relationship replacement during the negative sampling process of TransR and train with the best parameters marked in the paper %.
We use mini-batch SGD to help train the TRKRL model. In this paper, the best configuration of parameters is size S = 4800, margin γ = 1.0, descending weight η = 0.1, and learning rate λ designed by a linear-declined function. The training dimension for all models is 100. In the course of our experiments, we used several models for comparison. Among them, TransE and TransR are trained with the best parameters reported in their respective papers. For other baselines including RESCAL, SE, SME, LFM, and TKRL, we use the results reported directly.

Evaluation Protocol
The complementary task of the KG refers to completing any of the missing elements in the triple. Taking entity prediction as an example, the comparison process of relationship prediction is similar. Three principle assessment metrics are focused on, i.e., (1) the mean rank of correct entities (MR), (2) mean reciprocal rank of correct entities (MRR), and (3) proportion of correct answers ranked in top n (Hits@n). The evaluation settings are named "Raw" and "Filter".
The KG completion task requires entity and relationship information, so we divide this task into entity and relationship prediction sub-tasks. We use FB15K for evaluation and the same evaluation conditions for all models to ensure the reliability of the results. Table 3 shows the entity prediction results, from which we can observe the following.

Entity Prediction
(1) Our method (TRKRL) surpasses other baselines in all indicators. This illustrates that the fusion of logical rule information and hierarchical type information can improve the representation learning of the KG. (2) The results of TRKRL and TKRL on the MR and the number of Hits@10 are better than those of all baselines. The results show that the embedding of the hierarchical type information of entities and relationships can improve the representation learning of the KG. (3) In particular, TRKRL is superior to TKRL in every metric, which shows the advantages of introducing logical rules in providing higher path synthesis accuracy and learning better path embedding.  Table 4 displays the results on the FB15K dataset for all compared methods. We adopt two classic models, TransE and TransR, as comparison objects. Consistent with our conjecture, the results obtained after data filtering have lower mean ranks and higher hits@10 than the results of without filtering. More specifically, we observe the following.

Relationship Prediction
(1) Our method, TRKRL, outperforms all baselines on all metrics. In particular, it achieves a superior absolute performance score of 94.1% on the hits@10 index. This indicates that the logic rules information added in TRKRL is not only conducive to entity prediction, but also conducive to relationship prediction. (2) The mean rank results of TKRL and TRKRL before filtering are significantly improved compared to other baselines, which illustrates the positive impact of hierarchical type information as a constraint.

Ablation Study
In order to fully prove the universality and reliability of our proposed method, we also conducted test experiments on the FB15K-237 dataset. Compared with the classic datasets (i.e., FB15K, WN18, etc.), the FB15K-237 dataset has been constructed only recently. At present, relatively little work has been done on this dataset showing test results, so we can use it as a baseline. Table 5 shows the experimental results, in which TRKRL obtains the best performance with approximately 25.47% improvement compared to best baseline TransR on Mean Reciprocal Rank. According to the results of Mean Reciprocal Rank and Hits@10, it is found that TransR is more suitable for the FB15K-237 dataset than most of the baselines. This may be attributed to the fact that TransR clusters entities with the same relationship. Although FB15K-237 eliminates the reverse relation, we can use Horn rules to help establish semantic associations.
To verify that the components of TRKRL are meaningful, we performed entity prediction ablation experiments on FB15K, and removed the paths, hierarchical types, and logic rules from TRKRL. As shown in Table 6, TRKRL-P, TRKRL-HT, and TRKRL-LR represent the model TRKRL without paths, hierarchical types, and logic rules, respectively. Obviously, deleting any one component will cause the performance degradation of the model. This illustrates that the multi-information fusion theory proposed by us is completely beneficial to knowledge representation learning.

Evaluation Protocol
The purpose of this task is to confirm whether the triple (h, r, t) is correct or not. This task has been considered as one of the indicators for evaluating the performance of the learning model. To accomplish this task, we constructed negative examples for the FB15K dataset according to the method of Socher et al. [14]. The specific method is to determine different thresholds ζ for different relationships. When the dissimilarity score E(h, r, t) of the triple is higher than the threshold ζ, it is considered negative; otherwise, it is positive. Table 7 shows the result of triple classification, from which we can observe that TRKRL has the best performance, which shows the advantages of TRKRL over baselines in the triple classification, and further proves the superiority of the fusion of logical rules information and hierarchical type information. Table 7. Evaluation results on triple classification. Best score is in bold; second-best score is underlined.

Conclusions
In this paper, we propose the knowledge graph representation learning framework TRKRL, which combines rule path information and entity hierarchical type information. By integrating entity hierarchical type information, Horn rules, and relationship path information in a triple embedding framework, TRKRL improves the accuracy of representation learning and obtains better knowledge representation. For entity hierarchical type information, we use a type encoder to model the hierarchical type information and then treat it as a projection matrix of entities to cope with different scenarios in which entities have different type representations. For Horn rules and relational path information, we use logical rules to guide the synthesis of relations in paths to improve the accuracy of relational path reasoning. In addition, logical rules can also enhance the interpretability of representation learning. Experimental results show that TRKRL outperforms all other baselines, which illustrates the importance of entity hierarchical type information and logical rules information in guiding the synthesis of relationships in paths.
In planned future work, we will explore the following directions: (1) exploring new entity hierarchical type encoders to better model entity hierarchical type information; (2) exploring potential rules to guide the synthesis of relationships in the path to better combining rules and paths; and (3) introducing other auxiliary information, such as textual information and visual information, in order to learn a more complete representation.