Modeling Noncommutative Composition of Relations for Knowledge Graph Embedding

: Knowledge Graph Embedding (KGE) is a powerful way to express Knowledge Graphs (KGs), which can help machines learn patterns hidden in the KGs. Relation patterns are useful hidden patterns, and they usually assist machines to predict unseen facts. Many existing KGE approaches can model some common relation patterns like symmetry/antisymmetry, inversion, and commutative composition patterns. However, most of them are weak in modeling noncommutative composition patterns. It means these approaches can not distinguish a lot of composite relations like “father’s mother” and “mother’s father”. In this work, we propose a new KGE method called QuatRotatScalE (QRSE) to overcome this weakness, since it utilizes rotation and scaling transformations of quaternions to design the relation embedding. Speciﬁcally, we embed the relations and entities into a quaternion vector space under the difference norm KGE framework. Since the multiplication of quaternions does not satisfy the commutative law, QRSE can model noncommutative composition patterns naturally. The experimental results on the synthetic dataset also support that QRSE has this ability. In addition, the experimental results on real-world datasets show that QRSE reaches state-of-the-art in link prediction problem.


Introduction
Knowledge Graph (KG) is composed by structured, objective facts.The facts are usually expressed in the form of triples as (h, r, t), where h, r, and t express the head entity, the relation, and the tail entity, respectively.For example, (China, located_in, Asia).Knowledge graphs have successfully supported many applications in various fields, such as recommender systems [1], question answering [2], information retrieval [3], and natural language processing [4].KGs have also attracted increased attention from both industry and academic communities.However, real-world knowledge graphs, such as Dbpedia [5], Freebase [6], Yago [7], and WordNet [8], are usually incomplete, which restricts their applications.Thus, knowledge graph completion has become a widely studied subject.This subject is usually formulated as a link prediction problem, i.e., predicting the missing links that should be in the knowledge graph.Generally speaking, it asks us to design an agent that takes the query as input and outputs some entities.The query may contain a head entity and a relation or a tail entity and a relation.Every outputted entity should be able to form a plausible triple together with the query.
So far, the fundamental way to deal with the link prediction problem is Knowledge Graph Embedding (KGE) in industry and academia [9][10][11][12][13].In this way, the agent needs to learn a low-dimensional vector representation, also called embedding, for each entity and relation.We have to design a scorer that can grade any triple in the embedding form for its plausibility.When predicting the unknown entity, the agent only needs to grade all possible triples (composed by the query with each candidate entity) and then take the candidate entities of high score triples as the predicted result.
There are two reasons why these knowledge-graph embedding methods can tackle the link prediction problem effectively.On the one hand, there are many utilizable relation patterns in real-world KGs, such as symmetry/antisymmetry, inversion, and composition.These relation patterns are generally presented as natural redundancies in KGs, such as triple (China, located_in, Asia) and triple (Asia, includes, China) may exist in some KGs simultaneously, and they are describing the same fact.Here, located_in and includes are inverse relations for each other.On the other hand, existing KGE models have been able to model most relation patterns, i.e., evaluate the plausibility of triples by utilizing the relation patterns.For example, TransE [10] can model inversion patterns.When there are many natural redundancies relevant to located_in and includes in training KG, even if it has only seen triple (China, located_in, Asia) but not seen triple (Asia, includes, China), the TransE model can still mark a high plausibility score for triple (Asia, includes, China).
However, as far as we know, almost none of the existing KGE models can perfectly model the aforementioned relation patterns.For example, RotatE [13] declares that it can model all of the relation patterns, but it still has a fatal defect in modeling the composition patterns: It can only model commutative composition patterns, but can not model noncommutative composition patterns.This defect is also existing in some other KGE models which claim themselves can model composition patterns, such as TransE.Briefly speaking, a composition pattern implies the relation pattern that in the shape of r 1 ⊕ r 2 = r 3 , where ⊕ means the ordered composition of r 1 and r 2 .If composition pattern r 1 ⊕ r 2 = r 3 exists in some KG, it means that KG has frequent natural redundancies in the form of [(e 1 , r 1 , e 2 ), (e 2 , r 2 , e 3 ), (e 1 , r 3 , e 3 )], e i (i = 1, 2, 3) can be any entity.If r 1 ⊕ r 2 = r 3 and Both of RotatE and TransE model the noncommutative composition pattern as the commutative composition pattern by mistake.This mistake will bring severely ridiculous inferences.For example, they will infer (i.e., mark a high score for) triple (Mary, is_the_father_of, Barbara) based on existing triples (Mary, is_the_mother_of, James) and (James, is_the_husband_of, Barbara).The primary cause of this mistake is that they have not taken the design inspiration of their models carefully.They both expect to express a fact triple (h, r, t) through an equation relevant to the embeddings of h, r, and t (noted as boldface letters h, r, and t): h r = t, where is some binary operation.Thus, they design the score function in the form of − h r − t , and we can see that the closer the equation is to hold, the higher the plausibility score is.TransE embeds entities and relations into the real number vector space and takes the addition in that space as , while RotatE replaces the real numbers with the complex numbers and takes the element-wise multiplication as .Since these two operations both satisfy the commutative law, the corresponding two models can only model commutative composition patterns.
Inspired by QuatE [14], which will be discussed in Section 2.1.2,we propose a new KGE model called QuatRotatScalE (or QRSE, for short) in this paper.The main difference from TransE and RotatE is it embeds entities and relations into the quaternion [15] vector space and takes the element-wise multiplication in that space as .Because the quaternion multiplication generally does not satisfy the commutative law (but there are special cases where the law holds), QRSE can model both composition patterns.Furthermore, we can prove that QRSE can also model the rest relation patterns.Thus it has become one of the KGE models that can model most relation patterns up to now.We evaluated QRSE and compared it with many baselines in two well-established and widely used real-world datasets FB15k-237 [16] and WN18RR [17].The results indicate our method has reached the state-of-the-art in link prediction problem.

Related Work
At present, there are two classes of methods to solve the knowledge graph completion or link prediction problem.One class of methods is the KGE methods, and the other is the path-finding methods.Both of them are introduced below:

KGE Method
Embedding methods are widely used in many fields of machine learning since the embeddings of sentences, graphs, and many other data types can be easily transferred to various downstream tasks with only a little task-specific fine-tuning.For example, studies [18,19] first learn the embeddings of sentences and then use these embeddings to perform sentiment classification.Study [20] first learns an embedding for each graph, then use these embeddings to predict the missing labels of graphs.In addition, some other studies learn an embedding vector for each object (i.e., the node in a graph) of a given Heterogeneous Information Network (HIN) in a (semi-)supervised [21] or selfsupervised [22] manner.Taking advantage of the learned embeddings of objects, they can fulfill many tasks, e.g., object classification, clustering, and visualization.
In knowledge graph completion or link prediction problem, Knowledge Graph Embedding methods are also the most studied methods.Let us use E to represent the set of all entities and use R to represent the set of all relations in KG.KGE methods need to assign a vector representation to every entity e ∈ E and relation r ∈ R, noted in boldface letters e and r, respectively.e or r is also called the embedding of e or r.In addition to this, KGE methods still need to design a score function f r (h, t) to mark the plausibility of the triple (h, r, t).The objective of optimization is to mark high scores for the true triples and low scores for the false triples.Based on the type of score function, we can further divide the KGE methods into two sorts, KGE based on difference norm and KGE based on semantic matching:

KGE Based on Difference Norm
The common motivation of this sort of method is they want to use a triple approximate equation f 1 (h, r) ≈ f 2 (t, r) to describe any triple (h, r, t), and the strict equation should hold for fact triples.As for the unknown triples, they think the proximity of the two sides can reflect the plausibility of the triple.Thus, the score functions of these methods are always in the form of f r (h, t) = − f 1 (h, r) − f 2 (t, r) .
Among them, there is a kind of method that is widely studied, called translational methods.We call them "translational" because the origin of this kind of method, TransE, uses the translation transformation to design the triple approximate equation.Precisely, it chooses the real number vector space R k as the embedding space and regards the relation embedding r as a translation transformation from head entity embedding h to tail entity embedding t.So it designs the triple approximate equation as h + r ≈ t.Following TransE, many improvements have emerged.TransH [23] claims it is better to assign a hyperplane in embedding space for every relation (the hyperplane's normal vector noted as r p ), and only regards r as a translation from the projection of h to the projection of t on that hyperplane.Hence the triple approximate equation of TransH is (I − r p r p )h + r ≈ (I − r p r p )t, where I is the identity matrix.TransR [24] generalizes TransH, it assigns a linear map to every relation r, noted as transfer matrix W r .This linear map maps h and t into the relation space.Then TransR utilizes the images of h and t in the relation space with r to design the triple approximate equation in TransE's style: W r h + r ≈ W r t.Further, StransE [25] assigns each relation r two different transfer matrices W r,1 and W r,2 .Similarly, the triple approximate equation is designed as W r,1 h + r ≈ W r,2 t.These derivative methods of TransE are collectively known as TransX.Their score functions can be written in the form of f r (h, t) = − g r,1 (h) + r − g r,2 (t) , where g r,i (•) denotes a matrix multiplication concerning relation r.
Since the large number of the derivative methods of TransE, some literature uses the translational methods to refer to KGE based on difference norm in general.But this is not accurate enough.Some other methods do not turn to translation transformation to design their triple approximate equations, such as TorusE [26] and RotatE.TorusE chooses a compact Lie group as its embedding space and can be regarded as a special case of RotatE when the embedding modulus are fixed [13].RotatE embeds entities and relations into the complex number vector space C k .It wants to replace the translation in R k with the rotation in C k .Specifically, for each element r i (1 ≤ i ≤ k) of r, RotatE fixes it as a unitary complex number (i.e., |r i | = 1).Hence the complex multiplication between the i-th element of h (i.e., h i ) and r i means h i rotates in its complex plane with angle Arg(r i ) (i.e., the argument of complex r i ).Let us use • to denote the Hadamard (element-wise) product between two complex vectors, the triple approximate equation of RotatE is h • r ≈ t.
There are some KGE methods with score functions belonging to a special case of difference norm, which is in the form of − h r − t , where is some binary operation.When the ideal optimization is achieved, the triple approximate equations of these methods hold: h r = t.This property is useful to explain some abilities to model relation patterns.For example, TransE and RotatE are two of these methods, and because their binary operations are both associative and commutative, they can only model commutative composition patterns.For more details, please see Section 5.

KGE Based on Semantic Matching
The intuition of this sort of method is to measure the plausibility of a triple by inspecting the matching degree of the latent semantics of the two entities and the relation.
There is a family of methods called bilinear models that design score functions as bilinear maps of head and tail entities.RESCAL [9] may be the first bilinear model.It selects real vector space R k as the embedding space of entities and assigns a k × k real matrix W r to each relation r.Then it directly applies W r to define a bilinear map as the score function.To reduce the complexity of W r , DistMult [11] restricts W r to be a diagonal matrix.So DistMult can express W r as a vector r in R k and rewrite the score function in the form of the multi-linear dot product of h, r, and t.To overcome DistMult's weakness in modeling antisymmetry relation pattern, ComplEx [12] extends the embedding space into the complex vector space C k , and modifies the score function.QuatE [14] further develops ComplEx, it extends the embedding space into the quaternion vector space to obtain better expression ability.DualE [27] uses the dual quaternion vectors to design the embeddings of entities and relations, and chooses the dual quaternion inner product as the score function.DihEdral [28] designs entity embeddings with real vectors, and designs relation embeddings with dihedral group vectors, where each dihedral group is expressed as a second-order discrete real matrix.Although its score function is a bilinear form, which belongs to the type of semantic matching, it is theoretically proven that this score function is equivalent to a difference norm function in the form of − h r − t for optimizing relation embeddings.So DihEdral has the ability to model composition patterns like TransE and RotatE.Furthermore, since the multiplication of dihedral groups generally does not satisfy the commutative law, DihEdral can model noncommutative composition patterns.However, because the relation embeddings take discrete values, DihEdral has to use special treatments of the relation embeddings during the training process, and the actual performance is easily affected by special treatments.As for QuatE and DualE, their relation embeddings have the potential to model noncommutative composition patterns for the (dual) quaternion multiplication generally does not satisfy the commutative law.Nevertheless, because their score functions belong to the type of semantic matching and lack the theoretical equivalence to a difference norm function in the form of − h r − t like DihEdral at present, their abilities to model the composition patterns have no strict theoretical guarantees.More precisely, their triple approximate equations, if any, do not necessarily hold when the ideal optimization is achieved, which is a crucial but easily overlooked step for a rigorous proof.
Apart from bilinear models, some models based on neural networks emerged recently.Such as ConvE [17] and ConvKB [29] take the convolutional neural networks to construct the score functions.
Some mentioned KGE methods are listed in Table 1 with their score functions.Their abilities to model the relation patterns are shown in Table 2.We can see that our QRSE can model all relation patterns, which is a rare ability.
Table 1.Score functions and embedding spaces of several KGE models.a, b, c .= ∑ k i=1 a i b i c i means the multi-linear dot product of vector a, b, andc; • denotes conjugate for a complex or quaternion vectors; Re(•) denotes the real part of a complex number or quaternion; ⊗ indicates the Hadamard (element-wise) product between two quaternion vectors.Note that we report an equivalent formulation for QuatE to show the inheritance relationship with ComplEx.

Score Function Embedding Space
Additionally, "supervised relation composition" [30] is a method that can model composition patterns under supervision.But it is not a KGE method.Its goal is to design and train a function model that can take the embeddings of two relations as input and output the embedding of the composite relation of these two relations.The relation embeddings used are provided by an existing KGE model and are fixed once obtained.The supervisory information used for training is mined from the original KGs by another method.This method and the KGE models mentioned before belong to different research directions.The direction of KGE models studies how to directly model relation patterns (including composition patterns) by training entity and relation embeddings from the original KGs.

Path-Finding Method
This class of methods does not need score function to predict unknown entities, such as MINERVA [31], MultiHopKG [32], and DeepPath [33].Instead, they should start from the query entity node and follow the direction implied by the query relation to search the KG for the unknown entity.Compared with KGE methods, their results are explainable to some extent since they can provide the inference paths as evidence, but the lack of precision is their weakness at present.

Preliminaries
Before introducing our proposed method, let us briefly explain the related concepts and geometric meaning of quaternions.

A Brief Introduction of Quaternion
As an extened number system from the complex numbers C, quaternions H [15] have to import three fundamental quaternion units i, j, and k, which are not existing in the real numbers.Each quaternion q can be expressed as q = a + bi + cj + dk, where a, b, c, and d are all real numbers.The addition of quaternions is defined as (a The multiplication between any two fundamental quaternion units are defined as Obviously, this multiplication is associative but not commutative.For completeness, we also confirm the multiplication between any one in {i, j, k}, and a real number is commutative and associative.To obey the distributive law, we consequently get the multiplication between two arbitrary quaternions as: We can conclude that the multiplication of quaternions (also known as the Hamilton product) holds the associative and distributive law, but does not hold the commutative law in general.Nevertheless, there are some special cases where the commutative law holds.Some useful concepts of quaternions are listed as follows (let q = a + bi + cj + dk): Modulus: The modulus of q is written as |q| and is defined as Since the set of quaternions H is a linear space isomorphic to R 4 with basis (1, i, j, k), modulus means the length of q intuitively.In addition, if |q| = 1, q is called a unit quaternion.
Real and imaginary part: Similar to complex numbers, real number a is the real part of q, and real vector v .= (b, c, d) is the imaginary part of q.Sometimes we would like to express q in the form of [a, v] for convenience.Then, the multiplication of quaternions can be written as , where • is the dot product and × is the cross product.
Reciprocal: If q = 0, the reciprocal of q is the quaternion q −1 such that qq −1 = q −1 q = 1, and it is equivalent to define q −1 .

The Geometric Meaning of the Multiplication of Quaternions
To see the geometric meaning of the multiplication of quaternions, we have to view the H as a linear space isomorphic to R 4 with an orthonormal basis (1, i, j, k).Any q in H can be expressed in the form as ρ[cos θ, sin θn], where ρ ≥ 0 and n = 1.This is because if q = 0 we could set ρ = |q|, θ = arccos (a/|q|), and n = v/ v , whereas if q = 0 we could set ρ = 0 and choose θ and n arbitrarily.Note that [cos θ, sin θn] is a unit quaternion and implies the direction of q, while ρ implies the length of q.
Take another quaternion p = Without loss of generality, we suppose u = 0 and u is not parallel with n.Thus we can find another orthonormal basis of H:    As a special case, when u = 0 or u is parallel with n, p = p 1 .Thus at that time, p[cos θ, sin θn] only means the rotation in plane span([1, 0], [0, n]).Moreover, the geometric meaning of qp is almost the same with pq, except that the rotation for p 2 is clockwise.

Proposed Method
Now we start to introduce our proposed KGE model.The embedding spaces of entities E and relations R are both the quaternion vector space H k .For any e ∈ E and r ∈ R, their embeddings are noted as e and r in lower case bold letters, respectively.The i-th elements of e and r are written as e i (e i ∈ H) and r i (r i ∈ H) for every integer i from 1 to k.Our model is based on difference norm, so we first define its triple approximate equation as h ⊗ r ≈ t for any triple (h, r, t).Here, ⊗ denotes the Hadamard (element-wise) product between two quaternion vectors.So this triple approximate equation is equivalent to asking for h i r i ≈ t i for all i (1 ≤ i ≤ k).In consequence, we get our score function: Here, q is the abbreviation of q p,1 , for any quaternion vector q (q i = a i According to the geometric meaning of the quaternion multiplication, we can explain the purpose of this triple approximate equation intuitively: We treat each element of relation embedding r i (written in the form of ρ i [cos θ i , sin θ i n i ]) as a two-step transformation from h i to t i : (1) Rotate h i in two planes (span([1, 0], [0, n i ]) and span([0, n i,⊥ ], [0, n i,× ])) counterclockwise with angle θ i ; (2) Stretch h i with scaling factor ρ i .Thus we refer to our model as QuatRotatScalE (or QRSE, for short) due to we use Quaternions with Rotation and Scaling transformations to design the Embedding model.

Optimization
The general objective of KGE models is to return high scores for true triples and low scores for false triples.We adopt negative sampling as our training style to avoid the efficiency loss brought by the huge number of entities like most of the other KGE methods.The training KG usually only contains true triples (positive samples, noted as Ω) without false triples (negative samples).Thus we apply a common way (i.e., corrupting the positive samples) to obtain the negative samples.Suppose (h, r, t) is a positive sample, we can get two sets of negative samples by replacing the head or tail entity with other entities: N h (r, t) Following RotatE [13], we use the loss function on each triple (h, r, t) in the training KG as where σ is the sigmoid function, γ is a fixed margin, and N is N h (r, t) or N t (h, r).In practice, N is regenerated in the same way (N h (r, t) or N t (h, r)) for every positive sample in one training batch.Once it turns to the next training batch, N should switch the regenerating way.p(h , r, t ) is the distribution of self-adversarial negative sampling proposed by RotatE [13] and is defined as where α is the temperature of sampling.The self-adversarial negative sampling can moderate the low efficiency of the uniform negative sampling.We also take Adam as our optimizer.Moreover, p(h , r, t ) plays a role of importance sampling ratio in L, so it need not backpropagate gradients through it.

Abilities to Model Relation Patterns
In this subsection , we prove that QRSE can model symmetry/antisymmetry, inversion, and composition patterns.Additionally, TransE and RotatE are unable to model noncommutative composition patterns.Next, if triple (h, r, t) is in the knowledge graph, we write it in the embedding space as h ⊗ r = t for QRSE because its score function is a special case of difference norm − h ⊗ r − t and when the ideal optimization is achieved, we can directly get h ⊗ r = t (we can replace ⊗ with + or • for TransE or RotatE for the same reason). •

QRSE can model symmetry/antisymmetry patterns:
Suppose e 2 ⊗ r = e 1 and e 1 ⊗ r = e 2 .We can get e 1 ⊗ r ⊗ r = e 1 .It means for any i (1 ≤ i ≤ k), e 1,i r i r i = e 1,i .If e 1,i = 0, r i can be any quaternion.But if e 1,i = 0, r i must satisfies: ∵ ∀q 1 , Define It means for any i (1 ,i , r 3,i , and r 1,i can be any quaternions.But if e 1,i = 0, r 2,i , r 3,i , and r 1,i must satisfy: Moreover, if we still suppose e 4 ⊗ r 3 = e 5 , e 5 ⊗ r 2 = e 6 , and e 4 ⊗ r 1 = e 6 .Then if e 4,i = 0 for all i (1 ≤ i ≤ k), r 3,i , r 2,i , and r 1,i must satisfy: r 3,i r 2,i = r 1,i .This means r 2,i r 3,i = r 3,i r 2,i .If we note r 2,i = [a 2,i , v 2,i ] and r 3,i = [a 3,i , v 3,i ], then we get: We can conclude that if r 2 ⊗ r 3 = r 1 , r 2 , r 3 , and r 1 model a composition pattern.Moreover, if v 2,i is parallel with v 3,i for all i(1 ≤ i ≤ k), it is a commutative composition pattern, otherwise, it is a noncommutative composition pattern.

Experiments
In this section, we first evaluate QRSE with RotatE on a small knowledge graph made up of two families.This experiment will verify the superiority of QRSE in modeling noncommutative composition relation patterns.Then we evaluate QRSE and compare it with many baselines in two well-established and widely used real-world datasets.

Experiment on a KG about Two Families
There are 10 entities and 4 relations in the training KG.Each entity is a member of one family, and each relation is a type of kinship.Such as triple (Am1, son, Am2) means Am1 has a son called Am2.All of the triples in the training KG are shown in Figure 2, where each directed edge represents a triple, and its direction is from the head entity to the tail entity.Furthermore, the test set contains two triples: (Bm1, daughter_of_son, Bw3) and (Bm1, son_of_daughter, Bm3).We let models predict the head or tail entity for each test triple, so there are 4 queries during the test process.We use Hit@1 to measure the performance of models, which means the proportion of the correctly answered queries (i.e., the true answer's score is ranked first) among all test queries.The test performances of RotatE and QRSE are shown in Figure 3.We can see that QRSE gets the best Hit@1 value 1.00 quickly, and after that, it keeps this Hit@1 value all the time during the training process.RotatE also gets the best Hit@1 value quickly; however, after that, it's Hit@1 value is always fluctuating between 0.5 and 1.00 randomly.To explain this phenomenon, we inspected the detailed scores and embeddings at step 16,000, which is large enough to ensure the convergence of the two models.The top 3 scores for all test queries are shown in Table 3.For the two queries to predict the head entity Bm1, the scores of Bm1 are much higher than the second candidate entities for both RotatE and QRSE.However for the two queries to predict the tail entities Bm3 and Bw3, only QRSE keeps the large gap between the first and the second score, whereas RotatE gives very close scores for the top 2 candidate entities on both of the two queries.This result reveals that, for RotatE, the score ranks for the top 2 candidates are unstable and easily affected by the random noise on the two queries to predict the tail.That is why the Hit@1 of RotatE fluctuates during training.Moreover, for RotatE, the top 2 candidate entities are Bm3 and Bw3 for both of the two tail queries.Thus we guess the embeddings of these two entities are also very close.We can also show this fact by directly inspecting the relation embeddings of the two models in Figure 5.Note that daughter ⊕ son and son ⊕ daughter are not the relations in KG but the combinations made up of the relations in KG.Their "embeddings" are calculated from the embeddings of some relations (e.g., the "embedding" of daughter ⊕ son is daughter ⊗ son in QRSE).Obviously, the embeddings of son_of_daughter and daughter_of_son are almost the same in RotatE, since they are both approaching daughter • son during training.However, they are different in QRSE since the embeddings of son_of_daughter is approaching daughter ⊗ son while the other is approaching son ⊗ daughter during training.We still evaluated our method on two well-established and widely used real-world knowledge graphs, FB15k-237 [16] and WN18RR [17], with several strong baselines.
FB15k-237 is selected from FB15k [10], which is a subset of Freebase and mainly records the facts about movies, actors, and sports.Because FB15k suffers from test leakage through inverse relations: there are too many inversion patterns in KG, which are too easy to model, and even a simple rule-based model can perform well [17].To make the results more reliable, FB15k-237 removed these inverse patterns.The statistics of FB15k-237 are 14,541 entities, 237 relations, 272,115 training triples, 17,535 validation triples, and 20,466 test triples.
WN18RR is selected from WN18 [10], which is a subset of WordNet and records lexical relations between words.WN18 also suffers from test leakage through inverse relations, so WN18RR removed its inverse patterns too.The statistics of WN18RR are 40,943 entities, 11 relations, 86,835 training triples, 3034 validation triples, and 3134 test triples.
From each test triple (h, r, t), we generate two queries: (?, r, t) and (h, r, ?).Given each query, we can make a candidate triple by placing a candidate entity on the place of the entity to predict.The score of each candidate entity is just the score of its corresponding candidate triple.While ranking all the scores, we omit the scores of those candidate triples that already exist in training, validation, and test set, except the true answer for the query.This process is called "filtered" in some literature and is widely adopted in existing methods to avoid possibly flawed evaluation.

Results
We adopt these standard evaluation measures for both of the datasets: the mean reciprocal rank of the true answers (MRR), the proportion of queries whose true answers are ranked in the top k (Hit@k).
The link prediction results on real-world datasets are shown in Table 4.The result of TransE is taken from [29].The results of DistMult, ComplEx, and ConvE are taken from [17].The results of RotatE and DualE are taken from [13,27], respectively.The results of DihEdral(STE) and DihEdral(Gumbel) are taken from [28], where STE and Gumbel are two special treatments of the discrete relation embeddings.The results of QuatE and QuatE(TC) are taken from [14], where TC indicates the corresponding model using type constraints [36].From this table, we can see that QRSE outperforms RotatE largely on all datasets and evaluation measures.This result supports our analysis of the modeling ability of the composition patterns.Compared with DihEdral(STE) and DihEdral(Gumbel), we find QRSE outperforms both of them on the two real-world datasets, whereas DihEdral(STE) is better than DihEdral(Gumbel) on FB15k-237 and just the opposite on WN18RR.This means the performance of DihEdral is easily affected by special treatments, and DihEdral can not perform well on the two real-world datasets simultaneously.Compared with DualE and QuatE, we find QRSE outperforms both of them too.This means that among all methods using (dual) quaternions so far, QRSE has explored the greatest potential of the (dual) quaternion space in the implementation of knowledge graph embedding.Because type constraints [36] can integrate prior knowledge into various KGE models and can significantly improve their performance in link prediction tasks, QRSE and most baselines display the results without it for fairness except QuatE(TC).Surprisingly, we can even see that QRSE is superior to QuatE with type constraints overall slightly.The success on this unfair comparison further demonstrates the excellence of QRSE.Overall, our QRSE has reached the state-of-the-art in link prediction problem on real-world datasets.

Conclusions and Future Work
We proposed a novel knowledge graph embedding model QRSE based on quaternions.QRSE is a KGE model that can model the noncommutative composition patterns.Besides, it can also model many other relation patterns, such as symmetry/antisymmetry, inversion, and commutative composition patterns.We varified these properties by theoretical proofs and experiments.From the definition of the triple approximate equation of QRSE, we can easily see that QRSE is a generalization of RotatE.Conversely, in some special cases, QRSE will degenerate to RotatE.For example, the case when the coefficients of j and k are fixed as 0 for all quaternions in all embeddings, and the modulus of all quaternions in relation embeddings are fixed as 1.Before QRSE, QuatE has already generalized ComplEx through replacing the complex numbers with quaternions.However, QuatE only takes advantage of that quaternions are more expressive than complex numbers.While our method not only leverages the expression advantage but also exploits the noncommutative property of quaternion multiplication to model the noncommutative composition patterns.The results of experiments on real-world datasets show that QRSE reaches the state-of-the-art on the link prediction problem.For future work, our plan is to combine QRSE with deep models for natural language processing.With its help, we expect deep models to achieve higher accuracy on question answering tasks and make the model's answers more interpretable.
[s, u] from H, then the product pq = ρ(p[cos θ, sin θn]) means a new quaternion reached via two steps from p : (1) changing the direction of p according to [cos θ, sin θn], (2) stretching the length by ρ times.So we only have to see what is the change implied by p[cos θ, sin θn].

Figure 1 .
Figure 1.How do p 1 and p 2 rotate when they are multiplied by [cos θ, sin θn] on the right.

Figure 2 .
Figure 2. The structure of the training KG, where each directed edge represents a triple.Since we need 2 and 4 real numbers to determine a complex number and a quaternion respectively, we take C 10 (i.e., embedding dimension k = 10) and H 5 (i.e., k = 5) as the entity embedding spaces for RotatE and QRSE.Thus in practice, we can express the entity embeddings of RotatE and QRSE as 20-D real vectors.Except for the embedding

Figure 3 .
Figure 3.The Hit@1 performance of RotatE and QRSE on the test set along with the training process.

Figure 4
Figure 4 shows the embeddings of Bm3 and Bw3 in RotatE and QRSE.As we guessed, the two embeddings are very close in RotatE but different in QRSE.This result verified that RotatE is unable to model noncommutative composition patterns, but QRSE can.Let us use the bold type to indicate the embeddings as before.For RotatE, along with the training process, Bw3 will close to Bm2 • daughter, and Bm2 will close to Bm1 • son.Hence Bw3

Figure 4 .
Figure 4.The entity embeddings of RotatE and QRSE on training step 16,000.The 10-D complex or 5-D quaternion vectors are expressed in the corresponding 20-D real vectors.

Figure 5 .
Figure 5.The relation embeddings of RotatE and QRSE on training step 16,000.A relation embedding of RotatE has 10 complex numbers with modulus 1, which are determined by their 10 arguments.Thus we express it by its 10 arguments in angle degrees.For QRSE, we continue use the corresponding 20-D real vectors for each relation embedding.
The size of negative samples N h (r, t) and N t (h, r) is fixed and much smaller than |E |.

• TransE and RotatE can not model noncommutative composition patterns, and they can only model commutative composition patterns:
For TransE, we suppose e 1 + r 2 = e 2 , e 2 + r 3 = e 3 , e 1 + r 1 = e 3 , e 4 + r 3 = e 5 , e 5 + r 2 = e 6 , but e 4 + r 1 = e 6 , which means the composition of relation r 2 and r 3 is noncommutative.From the first three equations we get r 2 + r 3 = r 1 , and from the fourth and fifth equations we get e 4 + r 3 + r 2 = e 6 .Because r 2 + r 3 = r 3 + r 2 , we get e 4 + r 1 = e 6 , which contradicts the condition.Therefore TransE can not model noncommutative composition patterns.If we replace the condition e 4 + r 1 = e 6 with e 4 + r 1 = e 6 , then the composition of relation r 2 and r 3 becomes commutative composition.In this case the previous contradiction disappears, which means TransE can model commutative composition patterns.As for RotatE, we suppose e 1 • r 2 = e 2 , e 2 • r 3 = e 3 , e 1 • r 1 = e 3 , e 4 • r 3 = e 5 , e 5 • r 2 = e 6 , but e 4 • r 1 = e 6 , which means the composition of relation r 2 and r 3 is noncommutative.Since r 2 • r 3 = r 3 • r 2 (the multiplication of complex numbers satisfies the commutative law), we can get e 4 • r 1 = e 6 in the same way as TransE, which contradicts the condition.So RotatE can not model noncommutative composition patterns .If we replace the condition e 4 • r 1 = e 6 with e 4 • r 1 = e 6 , then the composition of relation r 2 and r 3 becomes commutative composition.In this case the previous contradiction disappears, which means RotatE can model commutative composition patterns.

Table 3 .
The detailed test results of RotatE and QRSE at training step 16000.

Table 4 .
Link prediction results on the FB15k-237 and WN18RR datasets.Numbers in boldface are the best, and underlined numbers are the second best.