Triplet factbased embedding model treats the knowledge graph as a set of triplets containing all observed facts. In this section, we will introduce three groups of embedding models: translationbased models, tensor factorizationbased models, and neural networkbased models.
2.2.1. TranslationBased Models
Since Mikolov et al. [
49,
50] proposed a word embedding algorithm and its toolkit
word2vec, the distributed representations learning has attracted more and more attention. Using that model, the authors found that there are interesting translation invariance phenomena in the word vector space. For example:
where
$\overrightarrow{w}$ is the vector of word
w transformed by the
word2vec model. This result means that the word representation model can capture some of the same implicit semantic relationship between the words “
$King$” (“
$Man$”) and “
$Queen$” (“
$Woman$”). Mikolov et al. proved experimentally that the property of translation invariance exists widely in the semantic and syntactic relations of vocabulary.
Inspired by the
word2vec, Bordes et al. [
19] introduced the idea of translation invariance into the knowledge graph embedding field, and proposed the TransE embedding model. TransE represents all entities and relations to the uniform continuous, low, feature space
${\mathbb{R}}^{d}$, and the relations can be regarded as connection vectors between entities. Let
E and
R indicate the set of entity and relation, respectively. For each triplet (
$h,r,t$), the head and tail entity
h,
t, and the relation
r are embedded to the embedding vectors
$\mathbf{h}$,
$\mathbf{t}$, and
$\mathbf{r}$. As illustrated in
Figure 2a, for each triplet (
$h,r,t$), TransE follows a geometric principle:
The authenticity of the given triplet (
$h,r,t$) is computed via a score function. This score function is defined as a distant between
$\mathbf{h}+\mathbf{r}$ and
$\mathbf{t}$ under
${\ell}_{1}$norm or
${\ell}_{2}$norm constraints. In mathematical expressions, it is shown as follows:
In order to learn an effective embedding model which has the ability to discriminate the authenticity of triplets, TransE minimizes a marginbased hinge ranking loss function over the training process.
where
S and
${S}^{\prime}$ denote the set consisting of correct triplets and corrupted triplets, respectively, and
$\gamma $ indicates the margin hyperparameter. In training, TransE stochastically replaces the head or tail entity in each triplet with other candidate entities to generate corrupted triplet set
${S}^{\prime}$. The construction formula is shown in Equation (
5).
Although TransE has achieved a great advancement in largescale knowledge graph embedding, it still has difficulty in dealing with complex relations, such as
$1N$,
$N1$, and
$NN$ [
51,
52]. For instance, there is a
$1N$ relation where the head entity has multiple corresponding tail entities, i.e.,
$\forall \phantom{\rule{4pt}{0ex}}i\in \left(\right)open="\{"\; close="\}">1,2,...,n$. According to the guidelines TransE followed
$\mathbf{h}+\mathbf{r}\approx {\mathbf{t}}_{i}$, all embedding vectors of the tail entities should be approximately similar, as
${\mathbf{t}}_{1}\approx {\mathbf{t}}_{2}\approx ...\approx {\mathbf{t}}_{i}$. More visually, there are two triplet facts
$(Elon\_Musk,Founder\_of,SpaceX)$ and
$(Elon\_Musk,Founder\_of,Tesla)$, in which
$Founder\_of$ is a
$1N$ relation mentioned above. Following TransE, the embedding vectors of
$SpaceX$ and
$Tesla$ should be very similar in feature space. However, this result is clearly irrational because
$SpaceX$ and
$Tesla$ are two companies in entirely different fields, except
$ElonMusk$ is their founder. In addition, other complex relations such as
$N1$ and
$NN$ also raise the same problem.
To handle this issue in complex relations, TransH [
51] extended the original TransE model, it enables each entity to have different embedding representations when the entity is involved in diverse relations. In other words, TransH allows each relation to hold its own relationspecific hyperplane. Therefore, an entity would have different embedding vectors in different relation hyperplanes. As shown in
Figure 2b, for a relation
r, TransH employs the relationspecific translation vector
${\mathbf{d}}_{r}$ and the normal vector of hyperplane
${\mathbf{w}}_{r}$ to represent it. For each triplet fact
$(h,r,t)$, the embedding vectors of
$\mathbf{h}$ and
$\mathbf{t}$ are firstly projected to the relationspecific hyperplane in the direction of the normal vector
${\mathbf{w}}_{r}$.
${\mathbf{h}}_{\perp}$ and
${\mathbf{t}}_{\perp}$ indicate the projections.
Afterwards, the
${\mathbf{h}}_{\perp}$ and
${\mathbf{t}}_{\perp}$ are connected by the relationspecific translation vector
${\mathbf{d}}_{r}$. Similar to TransE, a small score is expected when
$(h,r,t)$ holds. The score function is formulated as follows:
Here, ${\u2225\xb7\u2225}_{2}^{2}$ is the squared Euclidean distance. By utilizing this relationspecific hyperplane, TransH can project an entity to different feature vectors depending on different relations, and solve the issue of complex relations.
Following this idea, TransR [
52] extended on the original TransH algorithm. Although TransH enables each entity to obtain a different representation corresponding to its different relations, the entities and relations in this model are still represented in the same feature space
${\mathbb{R}}^{d}$. In fact, an entity may contain various semantic meanings and different relations may concentrate on entities’ diverse aspects; therefore, entities and relations in the same semantic space might make the model insufficient for graph embedding.
TransR expands the concept of relationspecific hyperplanes proposed by TransH to relationspecific spaces. In TransR, for each triplet
$(h,r,t)$, entities are embedded as
$\mathbf{h}$ and
$\mathbf{t}$ into an entity vector space
${\mathbb{R}}^{d}$, and relations are represented as a translation vector
$\mathbf{r}$ into a relationspecific space
${\mathbb{R}}^{k}$. As illustrated in
Figure 2c, TransR projects
$\mathbf{h}$ and
$\mathbf{t}$ from the entity space into the relation space. This operation can render those entities (denoted as triangles with color) that are similar to head or tail entities (denoted as circles with color) in the entity space as distinctly divided in the relation space.
More specifically, for each relation
r, TransR defines a projection matrix
${\mathbf{M}}_{r}\in {\mathbb{R}}^{k\times d}$ to transform the entity vectors into the relationspecific space. The projected entity vectors are signified by
${\mathbf{h}}_{\perp}$ and
${\mathbf{t}}_{\perp}$, and the scoring function is similar to that of TransH:
Compared with TransE and TransH, TransR has made significant progress in performance. However, it also has some deficiencies: (i) For a relation r, the head and tail entities share the same projection matrix ${\mathbf{M}}_{r}$, whereas it is intuitive that the types or attributes between head and tail entities may be essentially different. For instance, in the triplet $(Elon\_Musk,Founder\_of,SpaceX)$, $Elon\_Musk$ is a person and $SpaceX$ is a company; they are two different types of entities. (ii) The projection from the entity space to the relationspecific space is an interactive process between entities and relations; it cannot capture integrated information when the projection matrix is generated only related to relations. (iii) Owing to the application of the projection matrix, TransR requires a large amount of computing resources, the memory complexity of which is $\mathcal{O}({N}_{e}d+{N}_{r}dk)$, compared to TransE and TransH with $\mathcal{O}({N}_{e}d+{N}_{r}k)$.
To eliminate the above drawback, an improved method TransD [
22] was proposed. It optimizes TransR by using two vectors for each entityrelation pair to construct a dynamic mapping matrix that could be a substitute for the projection matrix in TransR. Its illustration is in
Figure 2d. Specifically, given a triplet
$(h,r,t)$, each entity and relation is represented to two embedding vectors. The first vector represents meanings of the entity/relation, denoted as
$\mathbf{h}$,
$\mathbf{t}$ $\in {\mathbb{R}}^{d}$ and
$\mathbf{r}$ $\in {\mathbb{R}}^{k}$, and the second vector (defined as
${\mathbf{h}}_{p}$,
${\mathbf{t}}_{p}$ $\in {\mathbb{R}}^{d}$ and
${\mathbf{r}}_{p}$ $\in {\mathbb{R}}^{k}$) is used to form two dynamic projection matrices
${\mathbf{M}}_{rh},{\mathbf{M}}_{rt}\in {\mathbb{R}}^{k\times d}$. These two matrices are calculated as:
where
${\mathbf{I}}^{k\times d}$ is an identity matrix. Therefore, the projection matrix involve both entity and relation, and the embedding vectors of
$\mathbf{h}$ and
$\mathbf{t}$ are defined as:
Finally, the score function is the same as that of TransR in Equation (
9). TransD refined this model, it constructs a dynamic mapping matrix with two projection vectors, that effectually reduces the computation complexity to
$\mathcal{O}({N}_{e}d+{N}_{r}k)$.
All the methods described above including Trans (E, H, R, and D) ignore two properties of existing KGs: heterogeneity (some relations have many connections with entities and others do not), which causes underfitting on complex relations or overfitting on simple relations, and the imbalance (there is a great difference between head and tail entities for a relation in quantities), which indicates that the model should treat head and tail entities differently. TranSparse [
24] overcomes the heterogeneity and imbalance by applying two model versions: TranSparse(share) and TranSparse(separate).
TranSparse (share) leverages adaptive sparse matrices
${\mathbf{M}}_{r}\left({\theta}_{r}\right)$ to replace dense projection matrices for each relation
r. The sparse degree
${\theta}_{r}$ is linked to the number of entities connected with relation
r; it is defined as follows:
where
${\theta}_{min}$ (
$0\le {\theta}_{min}\le 1$) is a hyperparameter,
${N}_{r}$ denotes the number of entity pairs connected with the relation, and
${N}_{{r}^{\ast}}$ represents the maximum of them. Therefore, the projected vectors are formed by:
TranSparse (separate) employs two separate sparse mapping matrices,
${\mathbf{M}}_{rh}\left({\theta}_{rh}\right)$ and
${\mathbf{M}}_{rt}\left({\theta}_{rt}\right)$, for each relation, where
${\mathbf{M}}_{rh}\left({\theta}_{rh}\right)$ projects the head entities and the other projects the tail entities. The sparse degree and the projected vectors are extended as follows:
A simpler version of this method was proposed by Nguyen et al. [
53], called sTransE. In that approach, the sparse projection matrices
${\mathbf{M}}_{rh}\left({\theta}_{rh}\right)$ and
${\mathbf{M}}_{rt}\left({\theta}_{rt}\right)$ are replaced by mapping matrices
${\mathbf{M}}_{rh}$ and
${\mathbf{M}}_{rt}$, such that the projected vectors are transformed to:
The methods introduced so far merely modify the definition of the projection vectors or matrices, but they do not consider other aspects to optimize TransE. TransA [
54] also boosted the performance of the embedding model from another view point by modifying the distance measure of the score function. It introduces adaptive Mahalanobis distance as a better indicator to replace the traditional Euclidean distance because Mahalanobis distance shows better adaptability and flexibility [
55]. Given a triplet
$(h,r,t)$,
${\mathbf{M}}_{r}$ is defined as a symmetric nonnegative weight matrix with the relation
r; the score function of
TransA is formulated as:
As shown in
Figure 2e, the two arrows represent the same relation
$HasPart$, and
$(Room,HasPart,Wall)$ and
$(Sleeping,HasPart,Dreaming)$ are true facts. If we use the isotropic Euclidean distance, which is utilized in traditional models to distinguish the authenticity of a triplet, it could yield erroneous triplets such as
$(Room,HasPart,Goniff)$. Fortunately, TransA has the capability of discovering the true triplet via introducing the adaptive Mahalanobis distance because the true one has shorter distances in the
x or
y directions.
The above methods embedded the entity and relation to a real number space, KG2E [
56] proposed a novel approach that introduces the uncertainty to construct a probabilistic knowledge graph embedding model. KG2E takes advantage of multi, Gaussian distributions to embed entities and relations; each entity and relation is represented to a Gaussian distribution, in which the mean of this Gaussian distribution is the position of the entity or relation in a semantic feature space, and the covariance signifies the uncertainty of the entity or relation, i.e.,
Here,
${\mu}_{h},{\mu}_{r},{\mu}_{t}\in {\mathbb{R}}^{d}$ are the mean vectors of
$\mathbf{h}$,
$\mathbf{r}$, and
$\mathbf{t}$, respectively, and
${\Sigma}_{h},{\Sigma}_{r},{\Sigma}_{t}\in {\mathbb{R}}^{d\times d}$ indicate the covariance matrices corresponding to entities and relations. KG2E also utilizes the distance between
$\mathbf{h}\mathbf{t}$ and
$\mathbf{r}$ as metric to distinguish the authenticity of triplets, where the traditional
$\mathbf{h}\mathbf{t}$ is transformed to the distribution formula of
$\mathcal{N}({\mu}_{h}{\mu}_{t},{\Sigma}_{h}{\Sigma}_{t})$. This model employs two categories of approaches to estimate the similarity between two probability distributions: Kullback–Leibler divergence [
57] and expected likelihood [
58].
Figure 2f displays an illustration of KG2E. Each entity is represented as a circle without underline and the relations are the circles with underline. These circles with the same color indicate an observed triplet, where the head entity of all triplets is
$Hillary\phantom{\rule{4pt}{0ex}}Clinton$. The area of a circle denotes the uncertainty of the entity or relation. As we can see in
Figure 2e that there are three triplets, and the uncertainty of the relation “
$spouse$” is lower than the others.
TransG [
59] discusses the new situation of multiple relationship semantics, that is, when a relationship is associated with different entity pairs, it may contain multiple semantics. It also uses a Gaussian distribution to embed entities, but it is significantly different from KG2E because it leverages a Gaussian mixture model to represent relations:
where
${\pi}_{r,m}$ is the weight of distinct semantics and
$\mathbf{I}$ indicates an identity matrix. As shown in
Figure 2g, dots are correct entities related to the relation “
$Has\phantom{\rule{4pt}{0ex}}part$” and the triangles represent corrupted entities. In the traditional models (left), the corrupted entities do not have the ability to distinguish from the entire set of entities because all semantics are confused in the relation “
$Has\phantom{\rule{4pt}{0ex}}part$.” In contrast, TransG (right) can find the incorrect entities by utilizing multiple semantic components.
TransE only has the ability to handle simple relations, and it is incompetent for complex relations. The extensions of TransE, including TransH, TransR, TransD, TransA, TransG, and so forth, proposed many thoughtful and insightful models to address the issue of complex relations. Extensive experiments in public benchmark datasets, which are generated from WordNet [
60] and Freebase, show that these modified models achieve significant improvements with respect to the baseline, and verify the feasibility and validity of these methods. A comparison of these models in terms of their scoring functions and space size is shown in
Table 2.
2.2.2. Tensor FactorizationBased Models
Tensor factorization is another category of effective methods for KG representation learning. The core idea of these methods is described as follows. First, the triplet facts in the KG are transformed into a 3D binary tensor
$\mathcal{X}$. As illustrated in
Figure 3, given a tensor
$\mathcal{X}\in {\mathbb{R}}^{n\times n\times m}$, where
n and
m denote the number of entities and relations, respectively, each slice
${\mathcal{X}}_{k}$ (
$k=1,2,...,m$) corresponds to a relation type
${R}_{k}$, and
${\mathcal{X}}_{ijk}=1$ refers to the fact that the triplet (
ith entity,
kth relation,
jth entity) exists in the graph; otherwise,
${\mathcal{X}}_{ijk}=0$ indicates a nonexistent or unknown triplet. After that, the embedding matrices associated with the embedding vectors of the entities and relations are calculated to represent
$\mathcal{X}$ as some factors of a tensor. Finally, the low, representation for each entity and relation is generated along with these embedding matrices.
RESCAL [
61] applies a tensor to express the inherent structure of a KG and uses the rank
d factorization to obtain its latent semantics. The principle that this method follows is formulated as:
where
$A\in {\mathbb{R}}^{n\times d}$ is a matrix that captures the latent semantic representation of entities and
${R}_{k}\in {\mathbb{R}}^{d\times d}$ is a matrix that models the pairwise interactions in the
kth relation. According to this principle, the scoring function
${f}_{r}(h,t)$ for a triplet
$(h,r,t)$ is defined as:
Here,
$\mathbf{h}$,
$\mathbf{t}\in {\mathbb{R}}^{d}$ are the embedding vectors of entities in the graph and the matrix
${\mathbf{M}}_{r}\in {\mathbb{R}}^{d\times d}$ represents the latent semantic meanings in the relation. It is worth noting that
${\mathbf{h}}_{i}$ and
${\mathbf{t}}_{j}$, which represent the embedding vectors of the
ith and
jth entity, are actually assigned by the values of the
ith and
jth row of matrix A in Equation (
21). A more complex version of
${f}_{r}(h,t)$ was proposed by GarcíaDurán et al. [
62], which extends RESCAL by introducing wellcontrolled twoway interactions into the scoring function.
However, this method requires
$\mathcal{O}({N}_{e}d+{N}_{r}{k}^{2})(d=k)$ parameters. To simplify the computational complexity of RESCAL, DistMult [
63] restricts
${M}_{r}$ to be diagonal matrices, i.e.,
${\mathbf{M}}_{r}=\mathrm{diag}\left(\mathbf{r}\right),\mathbf{r}\in {\mathbb{R}}^{d}$. The scoring function is transformed to:
DistMult not only reduces the algorithm complexity into $\mathcal{O}({N}_{e}d+{N}_{r}k)(d=k)$, but the experimental results also indicate that this simple formulation achieves a remarkable improvement over the others.
Hole [
64] introduced a circular correlation operation [
65], denoted as
$\star :{\mathbb{R}}^{d}\times {\mathbb{R}}^{d}\to {\mathbb{R}}^{d}$, between head and tail entities to capture the compositional representations of pairwise entities. The formulation of this operation is shown as follows:
The circular correlation operation has the significant advantage that it can reduce the complexity of the composite representation compared to the tensor product. Moreover, its computational process can be accelerated via:
Here,
$\mathcal{F}(\xb7)$ indicates the fast Fourier transform (FFT) [
66],
${\mathcal{F}}^{1}(\xb7)$ denotes its inverse, and
$\overline{\mathbf{a}}$ denotes the complex conjugate of
$\mathbf{a}$. Thus, the scoring function corresponding to Hole is defined as:
It is noteworthy that circular correlation is not commutative, i.e., $\mathbf{h}\star \mathbf{t}\ne \mathbf{t}\star \mathbf{h}$; therefore, this property is indispensable for modeling asymmetric relations to semantic representation space.
The original DistMult model is symmetric in head and tail entities for every relation; Complex [
67] leverages complexvalued embeddings to extend DistMult in asymmetric relations. The embeddings of entities and relations exist in the complex space
${\mathcal{C}}^{d}$, instead of the real space
${\mathcal{R}}^{d}$ in which DistMult embedded. The scoring function is modified to:
where
$\mathrm{Re}(\xb7)$ denotes the real part of a complex value, and
$\overline{\mathbf{t}}$ represents the complex conjugate of
$\mathbf{t}$. By using this scoring function, triplets that have asymmetric relations can obtain different scores depending on the sequences of entities.
In order to settle the independence issue of entity embedding vectors in Canonical Polyadic (CP) decomposition, SimplE [
68] proposes a simple enhancement of CP which introduces the reverse of relations and computes the average CP score of
$(h,r,t)$ and
$(t,{r}^{1},h)$:
where ∘ is the elementwise multiplication and
${\mathbf{r}}^{\prime}$ indicates the embedding vector of reverse relation. Inspired by Euler’s identity
${e}^{i\theta}=cos\theta +isin\theta $, RotatE [
69] introduces the rotational Hadmard product, it regards the relation as a rotation between the head and tail entity in complex space. The score function is defined as follows:
QuatE [
70] extends complex space into fourdimensional, hypercomplex space
$\mathbf{h},\mathbf{t},\mathbf{r}\in {\mathbb{H}}^{d}$, it utilizes the Hamilton product to obtain latent interdependency between entities and relations in this complex space:
where ⊗ denotes the Hamilton product.
Table 3 summarizes the scoring function and memory complexity for all tensor factorizationbased models.
2.2.3. Neural NetworkBased Models
Deep learning is a very popular and extensively used tool in many different fields [
71,
72,
73,
74]. They are also models with strong representation and generalization capabilities that can express complicated nonlinear projections. In recent years, it has become a hot topic to embed a knowledge graph into a continuous feature space by a neural network.
SME [
21] defines an energy function for semantic matching, which can be used to measure the confidence of each observed fact
$(h,r,t)$ by utilizing neural networks. As shown in
Figure 4a, in SME, each entity and relation is firstly embedded to its embedding feature space. Then, to capture intrinsic connections between entities and relations, two projection matrices are applied. Finally, the semantic matching energy associated with each triplet is computed by a fully connected layer. More specifically, given a triplet
$(h,r,t)$ in
Figure 4a, SME represents entities and relations to the semantic feature space by the input layer. The head entity vector
$\mathbf{h}$ is then combined with the relation vector
$\mathbf{r}$ to acquire latent connections (i.e.,
${g}_{left}(\mathbf{h},\mathbf{r})$). Likewise,
${g}_{right}(\mathbf{t},\mathbf{r})$ is generated from the tail entity and the relation vectors
$\mathbf{t}$ and
$\mathbf{r}$. Bordes et al. provided two types of
$g(\xb7)$ functions, so there are two versions for SME. SME is formulated by Equation (
31):
SME (bilinear) is formulated in Equation (
32):
where
$\mathbf{M}\in {\mathbb{R}}^{d\times d}$ is the weight matrix and
$\mathbf{b}$ denotes the bias vector. Finally, the
${g}_{left}(\mathbf{h},\mathbf{r})$ and
${g}_{right}(\mathbf{t},\mathbf{r})$ are concatenated to obtain the energy score
${f}_{r}(\mathbf{h},\mathbf{t})$ via a fully connected layer.
NTN [
20] proposes a neural tensor network to calculate the energy score
${f}_{r}(h,t)$. It replaces the standard linear layer, which is in the traditional neural network, by employing a bilinear tensor layer. As shown in
Figure 4b, given an observed triplet
$(h,r,t)$, the first layer firstly embeds entities to their embedding feature space. There are three inputs in the second nonlinear hidden layer including the head entity vector
$\mathbf{h}$, the tail entity vector
$\mathbf{t}$ and a relationspecific tensor
${\mathbf{T}}_{r}\in {\mathbb{R}}^{d\times d\times k}$. The entity vectors
$\mathbf{h}$,
$\mathbf{t}$ are embedded to a highlevel representation via projection matrices
${\mathbf{M}}_{r1},{\mathbf{M}}_{r2}\in {\mathbb{R}}^{k\times d}$ respectively. These three elements are then fed to the nonlinear layer to combine semantic information. Finally, the energy score is obtained by providing a relationspecific linear layer as the output layer. The score function is defined as follows:
where
$f\left(x\right)=tanhx$ indicates an activation function and
${\mathbf{b}}_{r}\in {\mathbb{R}}^{k}$ denotes a bias, which belongs to a standard neural network layer. Meanwhile, a simpler version of this model is proposed in this paper, called a single layer model (SLM), as shown in
Figure 4c. This is a special case of NTN, in which
${\mathbf{T}}_{r}=0$. The scoring function is simplified to the following form:
NTN requires a relationspecific tensor
${\mathbf{T}}_{r}$ for each relation, such that the number of parameters in this model is huge and it would be impossible to apply to largescale KGs. MLP [
75] provides a lightweight architecture in which all relations share the same parameters. The entities and relation in a triplet fact
$(h,r,t)$ are synchronously projected into the embedding space in the input layer, and they are involved in higher representation to score the plausibility by applying a nonlinear hidden layer. The scoring function
${f}_{r}(h,t)$ is formulated as follows:
Here, $f\left(x\right)=tanhx$ is an activation function, and ${\mathbf{M}}_{1},{\mathbf{M}}_{2},{\mathbf{M}}_{3}\in {\mathbb{R}}^{d\times d}$ represent the mapping matrices that project the embedding vectors $\mathbf{h},\mathbf{r}$, and $\mathbf{t}$ to the second layer; $\mathbf{m}\in {\mathbb{R}}^{d}$ are the second layer parameters.
Since the emergence of deep learning technology, more and more studies have been proposed for deep neural networks (DNNs) [
71,
76,
77]. NAM [
78] establishes a deep neural network structure to represent a knowledge graph. As illustrated in
Figure 4d, given a triplet
$(h,r,t)$, NAM firstly embeds each element in the triplet to embedding feature space. Then, the head entity vector
$\mathbf{h}$ and the relation vector
$\mathbf{r}$ are concatenated to a single vector
${\mathbf{z}}^{0}=[\mathbf{h};\mathbf{r}]$ as the input to feed to the
$L+1$ layer where it consists of
L rectified linear (ReLU) [
79] hidden layers as follows:
where
${\mathbf{M}}^{\ell}$ is the weight matrix and
${\mathbf{b}}^{\ell}$ is the bias for layer
ℓ. Finally, a score is calculated by applying the output pf the last hidden layer with the tail entity vector
$\mathbf{t}$:
where
$\sigma (\xb7)$ is a sigmoid activation function. A more complicated model is proposed in this paper, called relationmodulated neural networks (RMNN).
Figure 4e shows an illustration of this model. Compared with NAM, it generates a knowledgespecific connection (i.e., relation embedding
$\mathbf{r}$) to all hidden layers in the neural network. The layers are defined as follows:
where
${\mathbf{M}}^{\ell}$ and
${\mathbf{B}}^{\ell}$ denote the weight and bias matrices for layer
ℓ, respectively. After the feedforward process, RMNN can yield a final score using the last hidden layer’s output and the concatenation between the tail entity vector
$\mathbf{t}$ and relation vector
$\mathbf{r}$:
ConvKB [
80] captures latent semantic information in the triplets by introducing a convolutional neural network (CNN) for knowledge graph embedding. In the model, each triplet fact
$(h,r,t)$ is represented to a threerow matrix, in which each element is transformed as a row vector. The matrix is fed to a convolution layer to yield multiple feature maps. These feature maps are then concatenated and projected to a score that is used to estimate the authenticity of the triplet via the dot product operation with the weight vector.
More specifically, as illustrated in
Figure 4f,
$\gamma =3$. First, the embedding vectors
$\mathbf{h}$,
$\mathbf{r}$, and
$\mathbf{t}\in {\mathbb{R}}^{d}$ are viewed as a matrix
$\mathbf{A}=[\mathbf{h};\mathbf{r};\mathbf{t}]\in {\mathbb{R}}^{3\times d}$, and
${\mathbf{A}}_{i:}\in {\mathbb{R}}^{3\times 1}$ indicates the
ith column of matrix
$\mathbf{A}$. After that, the filter
$\mathbf{m}\in {\mathbb{R}}^{3\times 1}$ is slid over the input matrix to explore the local features and obtain a feature map
$\mathbf{a}=[{a}_{1},{a}_{2},...,{a}_{d}]\in {\mathbb{R}}^{d}$, such that:
where
$g(\xb7)$ signifies the ReLU activation function and
b indicates the bias. For this instance, there are three feature maps corresponding to three filters. Finally, these feature maps are concatenated as a representation vector
$\in {\mathbb{R}}^{3d}$, and calculated with a weight vector
$\mathbf{w}\in {\mathbb{R}}^{3d}$ by a dot product operation. The scoring function is defined as follows:
Here, $\mathrm{\Omega}$ is the set of filters, and $\mathbf{A}\ast \mathrm{\Omega}$ denotes that a convolution operation is applied to matrix $\mathbf{A}$ via the filters in the set $\mathrm{\Omega}$. $\mathrm{C}$ is the concatenation operator. It is worth mentioning that $\mathrm{\Omega}$ and $\mathbf{w}$ are shared parameters; they are generalized for all entities and relations.
In recent years, graph neural networks (GNNs) have captured more attention due to their great ability in representation of graph structure. RGCN [
81] is an improved model, which provides relationspecific transformation to represent knowledge graphs. The forward propagation is formulated as follows:
where
${x}_{i}^{\left(l\right)}\in {\mathbb{R}}^{{d}^{\left(l\right)}}$ signifies the hidden state of the entity
i in
lth layer,
${N}_{i}^{r}$ indicates a neighbor collection where it connects to entity
i with relation
$r\in R$,
${W}_{r}^{\left(l\right)}$ and
${W}_{o}^{\left(l\right)}$ are the weight matrices, and
${c}_{i,r}$ denotes the normalization process such as
${c}_{i,r}=\left(\right)open=""\; close="">{N}_{i}^{r}$.
Inspired by burgeoning generative adversarial networks (GANs), Cai et al. [
82] proposed a generative adversarial learning framework to improve the performance of the existing knowledge graph representation models and named it KBGAN. KBGAN’s innovative idea is to apply a KG embedding model as the generator to obtain plausible negative samples, and leverage the positive samples and generated negative samples to train the discriminator, which is the embedding model we desire.
A simple overview introducing the framework is shown in
Figure 5. There is a ground truth triplet
$(Microsoft,LocatedIn,Redmond)$, which is corrupted by disposing of its tail entity, such as
$(Microsoft,LocatedIn,?)$. The corrupted triplet is fed as an input into a generator (G) that can receive a probability distribution over the candidate negative triplets. Afterwards, the triplet with the highest probability
$(Microsoft,LocatedIn,SanFrancisco)$ is sampled as the output of the generator. The discriminator (D) utilizes the generated negative triplet and original truth triplet as input to train the model, and computes their score
d, which indicates the plausibility of the triplet. The two dotted lines in the figure denote the error feedback in the generator and discriminator. One more point to note is that each triplet factbased representation learning model mentioned above can be employed as the generator or discriminator in the KBGAN framework to improve the embedding performance.
Table 4 illustrates the scoring function and memory complexity of each neural networkbased model.
The above three types of triplet factbased models focus on modifying the calculation formula of the scoring function ${f}_{r}(h,t)$, and their other routine training procedures are roughly uniform. The detailed optimization process is illustrated in Algorithm 1. First, all entity and relation embedding vectors are randomly initialized. In each iteration, a subset of triplets ${s}_{batch}$ is constructed by sampling from the training set S, and it is fed into the model as inputs of the minibatch. For each triplet in ${s}_{batch}$, a corrupted triplet is generated by replacing the head or tail entity with other ones. After that, the original triplet set S and all generated corrupted triplets are incorporated as a batch training set ${T}_{batch}$. Finally, the parameters are updated by utilizing certain optimization methods.
Here, the set of corrupted triplets
${S}^{\prime}$ is generated according to
${S}_{(}^{\prime}h,r,t)=({h}^{\prime},r,t){h}^{\prime}\in E\cup (h,r,{t}^{\prime}){t}^{\prime}\in E$, and there are two alternative versions of the loss function. The marginbased loss function is defined as follows:
and the logistic loss function is:
where
$\gamma >0$ is a margin hyperparameter and
${y}_{hrt}\in \{1,1\}$ is a label that indicates the category (negative or positive) to which a given training triplet
$(h,r,t)$ belongs. These two functions can both be calculated by the stochastic gradient descent (SGD) [
83] or Adam [
84] methods.
Algorithm 1 Learning triplet factbased models. 
Input: The training set $S=\left\{\right(h,r,t\left)\right\}$, entity set E, relation set R, embedding dimension k Output: Entity and relation embeddings
 1:
initialize the entity embeddings $\mathbf{e}$ and relation embeddings $\mathbf{r}$  2:
loop  3:
${S}_{batch}\leftarrow sample(S,b)$// sample a subset from S with minibatch size b  4:
${T}_{batch}\leftarrow \varphi $ // initialize the set of pairs of triplets  5:
for $(h,r,t)\in {S}_{batch}$ do  6:
$({h}^{\prime},r,{t}^{\prime})\leftarrow sample\left({S}^{\prime}\right)$ // sample a corrupted triplet  7:
${T}_{batch}\leftarrow {T}_{batch}\cup \{\left(\right)open="("\; close=")">(h,r,t),({h}^{\prime},r,{t}^{\prime})\}$  8:
end for  9:
Update embeddings by minimizing the loss function  10:
end loop
