A Survey on Knowledge Graph Embedding: Approaches, Applications and Benchmarks

: A knowledge graph (KG), also known as a knowledge base, is a particular kind of network structure in which the node indicates entity and the edge represent relation. However, with the explosion of network volume, the problem of data sparsity that causes large-scale KG systems to calculate and manage difﬁcultly has become more signiﬁcant. For alleviating the issue, knowledge graph embedding is proposed to embed entities and relations in a KG to a low-, dense and continuous feature space, and endow the yield model with abilities of knowledge inference and fusion. In recent years, many researchers have poured much attention in this approach, and we will systematically introduce the existing state-of-the-art approaches and a variety of applications that beneﬁt from these methods in this paper. In addition, we discuss future prospects for the development of techniques and application trends. Speciﬁcally, we ﬁrst introduce the embedding models that only leverage the information of observed triplets in the KG. We illustrate the overall framework and speciﬁc idea and compare the advantages and disadvantages of such approaches. Next, we introduce the advanced models that utilize additional semantic information to improve the performance of the original methods. We divide the additional information into two categories, including textual descriptions and relation paths. The extension approaches in each category are described, following the same classiﬁcation criteria as those deﬁned for the triplet fact-based models. We then describe two experiments for comparing the performance of listed methods and mention some broader domain tasks such as question answering, recommender systems, and so forth. Finally, we collect several hurdles that need to be overcome and provide a few future research directions for knowledge graph embedding.


Introduction
Numerous large-scale knowledge graphs, such as SUMO [1], YAGO [2], Freebase [3], Wikidata [4], and DBpedia [5], have been released in recent years. These KGs have become a significant resource for many natural language processing (NLP) applications, from named entity recognition [6,7] and entity disambiguation [8,9] to question answering [10,11] and information extraction [12,13]. In addition, as an applied technology, a knowledge graph also supports specific applications in many industries. For instance, it can provide visual knowledge representation for drug analysis, disease diagnosis in the field of medicine [14,15]; in the field of e-commerce, it can be used to construct a product knowledge SpaceX Hawthorne For tackling these challenge, knowledge graph embedding has been provided and attracted much attention, as it has the capability of knowledge graph to a dense and low dimensional, feature space [19][20][21][22][23][24][25] and it can efficiently calculate the semantic relation between entities in low dimensional space and effectively solve the problems of computational complexity and data sparsity. This method can further be used to explore new knowledge from existing facts (link prediction [19,23]), disambiguate entities (entity resolution [22,24]), extract relations (relation classification [26,27]), etc. The embedding procedure is described as follows. Given a KG, the entities and relations are first randomly represented in a low-, vector space, and an evaluation function is defined to measure the plausibility of each fact triplet. At each iteration, the embedding vectors of entities and relations can then be updated by maximizing the global plausibility of facts with some optimization algorithm. Even though there are a large number of successful researches in modeling relational facts, most of them can only train an embedding model on an observed triplets dataset. Thereupon, there are increasing studies that focus on learning more generalizing KG embedding models by absorbing additional information, such as entity types [28,29], relation paths [30][31][32], and textual descriptions [33][34][35].
Generally, knowledge graph embedding can utilize a distributed representation technology to alleviate the issue of data sparsity and computational inefficiency. This approach has three crucial advantages.

•
The data sparsity problem has been effectively mitigated, because all elements in KGs including entities and relations are embedded to a continuous low-, feature space.

•
Compared with traditional one-hot representation, KG embedding employs a distributed representation method to transform the original KG. As a result, it is effective to improve the efficiency of semantic computing.

•
Representation learning uses a unified feature space to connect heterogeneous objects to each other, thereby achieving fusion and calculation between different types of information.
In this paper, we provide a detailed analysis of the current KG embedding technologies and applications. We systematically describe how the existing techniques address data sparsity and computation inefficiency problems, including the thoughts and technical solutions offered by the respective researchers. Furthermore, we introduce a wide variety of applications that benefit from KG embedding. Although a few surveys about KG representation learning have been published [36,37], we focus on a different aspect compared with these articles. Cai et al. [36] performed a survey of graph embedding, including homogeneous graphs [38][39][40], heterogeneous graphs [41][42][43], graphs with auxiliary information [28,44,45], and graphs constructed from non-relational data [46][47][48]. Compared with their work, we focus more specifically on KG embedding, which falls under heterogeneous graphs. In contrast to the survey completed by Wang et al. [37], we describe various applications to which KG embedding applies and compare the performance of the methods in these applications.
The rest of this article is organized as follows. In Section 2, we introduce the basic symbols and formal problem definition of knowledge graph embedding and discuss embedding techniques. We illustrate the general framework and training process of the model. In Section 3, we will explore the applications supported by KG embedding, and then compare the performance of the above representation learning model in the same application. Finally, we present our conclusions in Section 4 and look forward to future research directions.

Knowledge Graph Embedding Models
In this section, we firstly declare some notations and their corresponding explanations that will be applied in the rest of this paper. Afterward, we supply a general definition of the knowledge graph representation learning problem. Detailed explanations of notations are elucidated in Table 1. Table 1. Detailed explanation of notations.

Notations Explanations
h, r, t Head entity h, tail entity t, and relation r h, r, t The embedding vectors corresponding to h, r, t x i The i-th element in vector x A A numerical matrix A ij The i-th row and j-th column element in matrix A d The dimensionality of entity in embedding space k The dimensionality of relation in embedding space

Notation and Problem Definition
Problem 1. Knowledge graph embedding: Given a KG composed of a collection of triplet facts Ω = {< h, r, t >}, and a pre-defined dimension of embedding space d (To simplify the problem, we transform entities and relations into the uniform embedding space, i.e., d = k), KG embedding aims to represent each entity h ∈ E and relation r ∈ R into a d-, continuous vector space, where E and R indicate the set of entities and relations, respectively. In other words, a KG is represented as a set of d-, vectors, which can capture information of the graph, in order to simplify computations on the KG.
Knowledge graph embedding aims to map a KG into a dense, low-, feature space, which is capable of preserving as much structure and property information of the graph as possible and aiding in calculations of the entities and relations. In recent years, it has become a research hotspot, and many researchers have put forward a variety of models. The differences between the various embedding algorithms are related to three aspects: (i) how they represent entities and relations, or in other words, how they define the representation form of entitles and relations, (ii) how they define the scoring function, and (iii) how they optimize the ranking criterion that maximizes the global plausibility of the existing triplets. The different models have different insights and approaches with respect to these aspects.
We have broadly classified these existing methods into two categories: triplet fact-based, representation learning models and description-based representation learning models. In this section, we first clarify the thought processes behind the algorithms in these two types of graph embedding models, as well as the procedures by which they solve the representation problem. After that, the training procedures for these models are discussed in detail. It is worth noting that due to study limitations, we can not enumerate all relevant knowledge graph embedding methods. Therefore, we only describe some representative, highly cited and code-implemented algorithms.

Triplet Fact-Based Representation Learning Models
Triplet fact-based embedding model treats the knowledge graph as a set of triplets containing all observed facts. In this section, we will introduce three groups of embedding models: translation-based models, tensor factorization-based models, and neural network-based models.

Translation-Based Models
Since Mikolov et al. [49,50] proposed a word embedding algorithm and its toolkit word2vec, the distributed representations learning has attracted more and more attention. Using that model, the authors found that there are interesting translation invariance phenomena in the word vector space. For example: where − → w is the vector of word w transformed by the word2vec model. This result means that the word representation model can capture some of the same implicit semantic relationship between the words "King" ("Man") and "Queen" ("Woman"). Mikolov et al. proved experimentally that the property of translation invariance exists widely in the semantic and syntactic relations of vocabulary. Inspired by the word2vec, Bordes et al. [19] introduced the idea of translation invariance into the knowledge graph embedding field, and proposed the TransE embedding model. TransE represents all entities and relations to the uniform continuous, low-, feature space R d , and the relations can be regarded as connection vectors between entities. Let E and R indicate the set of entity and relation, respectively. For each triplet (h, r, t), the head and tail entity h ,t, and the relation r are embedded to the embedding vectors h, t, and r. As illustrated in Figure 2a, for each triplet (h, r, t), TransE follows a geometric principle:    The authenticity of the given triplet (h, r, t) is computed via a score function. This score function is defined as a distant between h + r and t under 1 -norm or 2 -norm constraints. In mathematical expressions, it is shown as follows: In order to learn an effective embedding model which has the ability to discriminate the authenticity of triplets, TransE minimizes a margin-based hinge ranking loss function over the training process.
where S and S denote the set consisting of correct triplets and corrupted triplets, respectively, and γ indicates the margin hyperparameter. In training, TransE stochastically replaces the head or tail entity in each triplet with other candidate entities to generate corrupted triplet set S . The construction formula is shown in Equation (5).
Although TransE has achieved a great advancement in large-scale knowledge graph embedding, it still has difficulty in dealing with complex relations, such as 1 − N, N − 1, and N − N [51,52]. For instance, there is a 1 − N relation where the head entity has multiple corresponding tail entities, i.e., ∀ i ∈ {1, 2, ..., n} , (h, r, t i ) ∈ S. According to the guidelines TransE followed h + r ≈ t i , all embedding vectors of the tail entities should be approximately similar, as t 1 ≈ t 2 ≈ ... ≈ t i . More visually, there are two triplet facts (Elon_Musk, Founder_o f , SpaceX) and (Elon_Musk, Founder_o f , Tesla), in which Founder_o f is a 1 − N relation mentioned above. Following TransE, the embedding vectors of SpaceX and Tesla should be very similar in feature space. However, this result is clearly irrational because SpaceX and Tesla are two companies in entirely different fields, except ElonMusk is their founder. In addition, other complex relations such as N − 1 and N − N also raise the same problem.
To handle this issue in complex relations, TransH [51] extended the original TransE model, it enables each entity to have different embedding representations when the entity is involved in diverse relations. In other words, TransH allows each relation to hold its own relation-specific hyperplane. Therefore, an entity would have different embedding vectors in different relation hyperplanes. As shown in Figure 2b, for a relation r, TransH employs the relation-specific translation vector d r and the normal vector of hyperplane w r to represent it. For each triplet fact (h, r, t), the embedding vectors of h and t are firstly projected to the relation-specific hyperplane in the direction of the normal vector w r . h ⊥ and t ⊥ indicate the projections.
Afterwards, the h ⊥ and t ⊥ are connected by the relation-specific translation vector d r . Similar to TransE, a small score is expected when (h, r, t) holds. The score function is formulated as follows: Here, · 2 2 is the squared Euclidean distance. By utilizing this relation-specific hyperplane, TransH can project an entity to different feature vectors depending on different relations, and solve the issue of complex relations.
Following this idea, TransR [52] extended on the original TransH algorithm. Although TransH enables each entity to obtain a different representation corresponding to its different relations, the entities and relations in this model are still represented in the same feature space R d . In fact, an entity may contain various semantic meanings and different relations may concentrate on entities' diverse aspects; therefore, entities and relations in the same semantic space might make the model insufficient for graph embedding.
TransR expands the concept of relation-specific hyperplanes proposed by TransH to relation-specific spaces. In TransR, for each triplet (h, r, t), entities are embedded as h and t into an entity vector space R d , and relations are represented as a translation vector r into a relation-specific space R k . As illustrated in Figure 2c, TransR projects h and t from the entity space into the relation space. This operation can render those entities (denoted as triangles with color) that are similar to head or tail entities (denoted as circles with color) in the entity space as distinctly divided in the relation space.
More specifically, for each relation r, TransR defines a projection matrix M r ∈ R k×d to transform the entity vectors into the relation-specific space. The projected entity vectors are signified by h ⊥ and t ⊥ , and the scoring function is similar to that of TransH: Compared with TransE and TransH, TransR has made significant progress in performance. However, it also has some deficiencies: (i) For a relation r, the head and tail entities share the same projection matrix M r , whereas it is intuitive that the types or attributes between head and tail entities may be essentially different. For instance, in the triplet (Elon_Musk, Founder_o f , SpaceX), Elon_Musk is a person and SpaceX is a company; they are two different types of entities. (ii) The projection from the entity space to the relation-specific space is an interactive process between entities and relations; it cannot capture integrated information when the projection matrix is generated only related to relations. (iii) Owing to the application of the projection matrix, TransR requires a large amount of computing resources, the memory complexity of which is O(N e d + N r dk), compared to TransE and TransH with O(N e d + N r k).
To eliminate the above drawback, an improved method TransD [22] was proposed. It optimizes TransR by using two vectors for each entity-relation pair to construct a dynamic mapping matrix that could be a substitute for the projection matrix in TransR. Its illustration is in Figure 2d. Specifically, given a triplet (h, r, t), each entity and relation is represented to two embedding vectors. The first vector represents meanings of the entity/relation, denoted as h, t ∈ R d and r ∈ R k , and the second vector (defined as h p , t p ∈ R d and r p ∈ R k ) is used to form two dynamic projection matrices M rh , M rt ∈ R k×d . These two matrices are calculated as: where I k×d is an identity matrix. Therefore, the projection matrix involve both entity and relation, and the embedding vectors of h and t are defined as: Finally, the score function is the same as that of TransR in Equation (9). TransD refined this model, it constructs a dynamic mapping matrix with two projection vectors, that effectually reduces the computation complexity to O(N e d + N r k).
All the methods described above including Trans (E, H, R, and D) ignore two properties of existing KGs: heterogeneity (some relations have many connections with entities and others do not), which causes underfitting on complex relations or overfitting on simple relations, and the imbalance (there is a great difference between head and tail entities for a relation in quantities), which indicates that the model should treat head and tail entities differently. TranSparse [24] overcomes the heterogeneity and imbalance by applying two model versions: TranSparse(share) and TranSparse(separate).
TranSparse (share) leverages adaptive sparse matrices M r (θ r ) to replace dense projection matrices for each relation r. The sparse degree θ r is linked to the number of entities connected with relation r; it is defined as follows: where θ min (0 ≤ θ min ≤ 1) is a hyperparameter, N r denotes the number of entity pairs connected with the relation, and N r * represents the maximum of them. Therefore, the projected vectors are formed by: TranSparse (separate) employs two separate sparse mapping matrices, M rh (θ rh ) and M rt (θ rt ), for each relation, where M rh (θ rh ) projects the head entities and the other projects the tail entities. The sparse degree and the projected vectors are extended as follows: A simpler version of this method was proposed by Nguyen et al. [53], called sTransE. In that approach, the sparse projection matrices M rh (θ rh ) and M rt (θ rt ) are replaced by mapping matrices M rh and M rt , such that the projected vectors are transformed to: The methods introduced so far merely modify the definition of the projection vectors or matrices, but they do not consider other aspects to optimize TransE. TransA [54] also boosted the performance of the embedding model from another view point by modifying the distance measure of the score function. It introduces adaptive Mahalanobis distance as a better indicator to replace the traditional Euclidean distance because Mahalanobis distance shows better adaptability and flexibility [55]. Given a triplet (h, r, t), M r is defined as a symmetric non-negative weight matrix with the relation r; the score function of TransA is formulated as: As shown in Figure 2e, the two arrows represent the same relation HasPart, and (Room, HasPart, Wall) and (Sleeping, HasPart, Dreaming) are true facts. If we use the isotropic Euclidean distance, which is utilized in traditional models to distinguish the authenticity of a triplet, it could yield erroneous triplets such as (Room, HasPart, Goni f f ). Fortunately, TransA has the capability of discovering the true triplet via introducing the adaptive Mahalanobis distance because the true one has shorter distances in the x or y directions.
The above methods embedded the entity and relation to a real number space, KG2E [56] proposed a novel approach that introduces the uncertainty to construct a probabilistic knowledge graph embedding model. KG2E takes advantage of multi, Gaussian distributions to embed entities and relations; each entity and relation is represented to a Gaussian distribution, in which the mean of this Gaussian distribution is the position of the entity or relation in a semantic feature space, and the covariance signifies the uncertainty of the entity or relation, i.e., Here, µ h , µ r , µ t ∈ R d are the mean vectors of h, r, and t, respectively, and Σ h , Σ r , Σ t ∈ R d×d indicate the covariance matrices corresponding to entities and relations. KG2E also utilizes the distance between h − t and r as metric to distinguish the authenticity of triplets, where the traditional h − t is transformed to the distribution formula of N (µ h − µ t , Σ h − Σ t ). This model employs two categories of approaches to estimate the similarity between two probability distributions: Kullback-Leibler divergence [57] and expected likelihood [58]. Figure 2f displays an illustration of KG2E. Each entity is represented as a circle without underline and the relations are the circles with underline. These circles with the same color indicate an observed triplet, where the head entity of all triplets is Hillary Clinton. The area of a circle denotes the uncertainty of the entity or relation. As we can see in Figure 2e that there are three triplets, and the uncertainty of the relation "spouse" is lower than the others.
TransG [59] discusses the new situation of multiple relationship semantics, that is, when a relationship is associated with different entity pairs, it may contain multiple semantics. It also uses a Gaussian distribution to embed entities, but it is significantly different from KG2E because it leverages a Gaussian mixture model to represent relations: where π r,m is the weight of distinct semantics and I indicates an identity matrix. As shown in Figure 2g, dots are correct entities related to the relation "Has part" and the triangles represent corrupted entities.
In the traditional models (left), the corrupted entities do not have the ability to distinguish from the entire set of entities because all semantics are confused in the relation "Has part." In contrast, TransG (right) can find the incorrect entities by utilizing multiple semantic components. TransE only has the ability to handle simple relations, and it is incompetent for complex relations. The extensions of TransE, including TransH, TransR, TransD, TransA, TransG, and so forth, proposed many thoughtful and insightful models to address the issue of complex relations. Extensive experiments in public benchmark datasets, which are generated from WordNet [60] and Freebase, show that these modified models achieve significant improvements with respect to the baseline, and verify the feasibility and validity of these methods. A comparison of these models in terms of their scoring functions and space size is shown in Table 2. Table 2. Comparison of translation-based models in terms of scoring functions and memory complexity. In KG2E, µ = µ h − µ r − µ t and Σ = Σ h + Σ r + Σ t ; m indicates the number of semantic components for each relation in TransG.

Model
Scoring Function f t (h, t) Memory Complexity TransG [59] Σ M r m=i π r,m exp −

Tensor Factorization-Based Models
Tensor factorization is another category of effective methods for KG representation learning. The core idea of these methods is described as follows. First, the triplet facts in the KG are transformed into a 3D binary tensor X . As illustrated in Figure 3, given a tensor X ∈ R n×n×m , where n and m denote the number of entities and relations, respectively, each slice X k (k = 1, 2, ..., m) corresponds to a relation type R k , and X ijk = 1 refers to the fact that the triplet (i-th entity, k-th relation, j-th entity) exists in the graph; otherwise, X ijk = 0 indicates a non-existent or unknown triplet. After that, the embedding matrices associated with the embedding vectors of the entities and relations are calculated to represent X as some factors of a tensor. Finally, the low-, representation for each entity and relation is generated along with these embedding matrices. RESCAL [61] applies a tensor to express the inherent structure of a KG and uses the rank-d factorization to obtain its latent semantics. The principle that this method follows is formulated as: where A ∈ R n×d is a matrix that captures the latent semantic representation of entities and R k ∈ R d×d is a matrix that models the pairwise interactions in the k-th relation. According to this principle, the scoring function f r (h, t) for a triplet (h, r, t) is defined as: Here, h, t ∈ R d are the embedding vectors of entities in the graph and the matrix M r ∈ R d×d represents the latent semantic meanings in the relation. It is worth noting that h i and t j , which represent the embedding vectors of the i-th and j-th entity, are actually assigned by the values of the i-th and j-th row of matrix A in Equation (21). A more complex version of f r (h, t) was proposed by García-Durán et al. [62], which extends RESCAL by introducing well-controlled two-way interactions into the scoring function.
However, this method requires O(N e d + N r k 2 )(d = k) parameters. To simplify the computational complexity of RESCAL, DistMult [63] restricts M r to be diagonal matrices, i.e., M r = diag(r), r ∈ R d . The scoring function is transformed to: DistMult not only reduces the algorithm complexity into O(N e d + N r k)(d = k), but the experimental results also indicate that this simple formulation achieves a remarkable improvement over the others.
Hole [64] introduced a circular correlation operation [65], denoted as : R d × R d → R d , between head and tail entities to capture the compositional representations of pairwise entities. The formulation of this operation is shown as follows: The circular correlation operation has the significant advantage that it can reduce the complexity of the composite representation compared to the tensor product. Moreover, its computational process can be accelerated via: Here, F (·) indicates the fast Fourier transform (FFT) [66], F −1 (·) denotes its inverse, and a denotes the complex conjugate of a. Thus, the scoring function corresponding to Hole is defined as: It is noteworthy that circular correlation is not commutative, i.e., h t = t h; therefore, this property is indispensable for modeling asymmetric relations to semantic representation space.
The original DistMult model is symmetric in head and tail entities for every relation; Complex [67] leverages complex-valued embeddings to extend DistMult in asymmetric relations. The embeddings of entities and relations exist in the complex space C d , instead of the real space R d in which DistMult embedded. The scoring function is modified to: where Re(·) denotes the real part of a complex value, and t represents the complex conjugate of t. By using this scoring function, triplets that have asymmetric relations can obtain different scores depending on the sequences of entities. In order to settle the independence issue of entity embedding vectors in Canonical Polyadic (CP) decomposition, SimplE [68] proposes a simple enhancement of CP which introduces the reverse of relations and computes the average CP score of (h, r, t) and (t, r −1 , h): where • is the element-wise multiplication and r indicates the embedding vector of reverse relation. Inspired by Euler's identity e iθ = cos θ + i sin θ, RotatE [69] introduces the rotational Hadmard product, it regards the relation as a rotation between the head and tail entity in complex space. The score function is defined as follows: QuatE [70] extends complex space into four-dimensional, hypercomplex space h, t, r ∈ H d , it utilizes the Hamilton product to obtain latent inter-dependency between entities and relations in this complex space: where ⊗ denotes the Hamilton product. Table 3 summarizes the scoring function and memory complexity for all tensor factorization-based models. Table 3. Comparison of tensor factorization-based models in terms of scoring functions and memory complexity.

Neural Network-Based Models
Deep learning is a very popular and extensively used tool in many different fields [71][72][73][74]. They are also models with strong representation and generalization capabilities that can express complicated nonlinear projections. In recent years, it has become a hot topic to embed a knowledge graph into a continuous feature space by a neural network. SME [21] defines an energy function for semantic matching, which can be used to measure the confidence of each observed fact (h, r, t) by utilizing neural networks. As shown in Figure 4a, in SME, each entity and relation is firstly embedded to its embedding feature space. Then, to capture intrinsic connections between entities and relations, two projection matrices are applied. Finally, the semantic matching energy associated with each triplet is computed by a fully connected layer. More specifically, given a triplet (h, r, t) in Figure 4a, SME represents entities and relations to the semantic feature space by the input layer. The head entity vector h is then combined with the relation vector r to acquire latent connections (i.e., g le f t (h, r)). Likewise, g right (t, r) is generated from the tail entity and the relation vectors t and r. Bordes et al. provided two types of g(·) functions, so there are two versions for SME. SME is formulated by Equation (31):  (32):

SME (bilinear) is formulated in Equation
where M ∈ R d×d is the weight matrix and b denotes the bias vector. Finally, the g le f t (h, r) and g right (t, r) are concatenated to obtain the energy score f r (h, t) via a fully connected layer.
NTN [20] proposes a neural tensor network to calculate the energy score f r (h, t). It replaces the standard linear layer, which is in the traditional neural network, by employing a bilinear tensor layer. As shown in Figure 4b, given an observed triplet (h, r, t), the first layer firstly embeds entities to their embedding feature space. There are three inputs in the second nonlinear hidden layer including the head entity vector h, the tail entity vector t and a relation-specific tensor T r ∈ R d×d×k . The entity vectors h, t are embedded to a high-level representation via projection matrices M r1 , M r2 ∈ R k×d respectively. These three elements are then fed to the nonlinear layer to combine semantic information. Finally, the energy score is obtained by providing a relation-specific linear layer as the output layer. The score function is defined as follows: where f (x) = tanh x indicates an activation function and b r ∈ R k denotes a bias, which belongs to a standard neural network layer. Meanwhile, a simpler version of this model is proposed in this paper, called a single layer model (SLM), as shown in Figure 4c. This is a special case of NTN, in which T r = 0. The scoring function is simplified to the following form: NTN requires a relation-specific tensor T r for each relation, such that the number of parameters in this model is huge and it would be impossible to apply to large-scale KGs. MLP [75] provides a lightweight architecture in which all relations share the same parameters. The entities and relation in a triplet fact (h, r, t) are synchronously projected into the embedding space in the input layer, and they are involved in higher representation to score the plausibility by applying a nonlinear hidden layer.
The scoring function f r (h, t) is formulated as follows: Here, f (x) = tanh x is an activation function, and M 1 , M 2 , M 3 ∈ R d×d represent the mapping matrices that project the embedding vectors h, r, and t to the second layer; m ∈ R d are the second layer parameters.
Since the emergence of deep learning technology, more and more studies have been proposed for deep neural networks (DNNs) [71,76,77]. NAM [78] establishes a deep neural network structure to represent a knowledge graph. As illustrated in Figure 4d where M is the weight matrix and b is the bias for layer . Finally, a score is calculated by applying the output pf the last hidden layer with the tail entity vector t: where σ(·) is a sigmoid activation function. A more complicated model is proposed in this paper, called relation-modulated neural networks (RMNN). Figure 4e shows an illustration of this model. Compared with NAM, it generates a knowledge-specific connection (i.e., relation embedding r) to all hidden layers in the neural network. The layers are defined as follows: where M and B denote the weight and bias matrices for layer , respectively. After the feed-forward process, RMNN can yield a final score using the last hidden layer's output and the concatenation between the tail entity vector t and relation vector r: ConvKB [80] captures latent semantic information in the triplets by introducing a convolutional neural network (CNN) for knowledge graph embedding. In the model, each triplet fact (h, r, t) is represented to a three-row matrix, in which each element is transformed as a row vector. The matrix is fed to a convolution layer to yield multiple feature maps. These feature maps are then concatenated and projected to a score that is used to estimate the authenticity of the triplet via the dot product operation with the weight vector.
More specifically, as illustrated in Figure 4f, γ = 3. First, the embedding vectors h, r, and t ∈ R d are viewed as a matrix A = [h; r; t] ∈ R 3×d , and A i: ∈ R 3×1 indicates the i-th column of matrix A. After that, the filter m ∈ R 3×1 is slid over the input matrix to explore the local features and obtain a feature map a = [a 1 , a 2 , ..., a d ] ∈ R d , such that: where g(·) signifies the ReLU activation function and b indicates the bias. For this instance, there are three feature maps corresponding to three filters. Finally, these feature maps are concatenated as a representation vector ∈ R 3d , and calculated with a weight vector w ∈ R 3d by a dot product operation. The scoring function is defined as follows: Here, Ω is the set of filters, and A * Ω denotes that a convolution operation is applied to matrix A via the filters in the set Ω. C is the concatenation operator. It is worth mentioning that Ω and w are shared parameters; they are generalized for all entities and relations.
In recent years, graph neural networks (GNNs) have captured more attention due to their great ability in representation of graph structure. R-GCN [81] is an improved model, which provides relation-specific transformation to represent knowledge graphs. The forward propagation is formulated as follows: o are the weight matrices, and c i,r denotes the normalization process such as c i,r = N r i . Inspired by burgeoning generative adversarial networks (GANs), Cai et al. [82] proposed a generative adversarial learning framework to improve the performance of the existing knowledge graph representation models and named it KBGAN. KBGAN's innovative idea is to apply a KG embedding model as the generator to obtain plausible negative samples, and leverage the positive samples and generated negative samples to train the discriminator, which is the embedding model we desire.
A simple overview introducing the framework is shown in Figure 5. There is a ground truth triplet (Microso f t, LocatedIn, Redmond), which is corrupted by disposing of its tail entity, such as (Microso f t, LocatedIn, ?). The corrupted triplet is fed as an input into a generator (G) that can receive a probability distribution over the candidate negative triplets. Afterwards, the triplet with the highest probability (Microso f t, LocatedIn, SanFrancisco) is sampled as the output of the generator. The discriminator (D) utilizes the generated negative triplet and original truth triplet as input to train the model, and computes their score d, which indicates the plausibility of the triplet. The two dotted lines in the figure denote the error feedback in the generator and discriminator. One more point to note is that each triplet fact-based representation learning model mentioned above can be employed as the generator or discriminator in the KBGAN framework to improve the embedding performance. Table 4 illustrates the scoring function and memory complexity of each neural network-based model.  Table 4. Comparison of neural network-based models in terms of scoring functions and memory complexity.

Model Scoring Function f t (h, t) Memory Complexity
The above three types of triplet fact-based models focus on modifying the calculation formula of the scoring function f r (h, t), and their other routine training procedures are roughly uniform. The detailed optimization process is illustrated in Algorithm 1. First, all entity and relation embedding vectors are randomly initialized. In each iteration, a subset of triplets s batch is constructed by sampling from the training set S, and it is fed into the model as inputs of the minibatch. For each triplet in s batch , a corrupted triplet is generated by replacing the head or tail entity with other ones. After that, the original triplet set S and all generated corrupted triplets are incorporated as a batch training set T batch . Finally, the parameters are updated by utilizing certain optimization methods.
Here, the set of corrupted triplets S is generated according to S ( h, r, t) = (h , r, t)|h ∈ E ∪ (h, r, t )|t ∈ E, and there are two alternative versions of the loss function. The margin-based loss function is defined as follows: (44) and the logistic loss function is: where γ > 0 is a margin hyperparameter and y hrt ∈ {−1, 1} is a label that indicates the category (negative or positive) to which a given training triplet (h, r, t) belongs. These two functions can both be calculated by the stochastic gradient descent (SGD) [83] or Adam [84] methods. T batch ← φ // initialize the set of pairs of triplets 5: for (h, r, t) ∈ S batch do 6: (h , r, t ) ← sample(S ) // sample a corrupted triplet 7: T batch ← T batch ∪ {((h, r, t), (h , r, t ))} 8: end for 9: Update embeddings by minimizing the loss function 10: end loop

Description-Based Representation Learning Models
The models introduced above embed knowledge graphs to a specific feature space only based on triplets (h, r, t). In fact, a large number of additional information associated with knowledge graphs can be efficiently absorbed to further refine the embedding model's performance, such as textual descriptions information and relation path information.

Textual Description-Based Models
Textual information falls into two categories: entity type information and descriptive information. The entity type specifies the semantic class to which the entity belongs. For example, Elon Musk belongs to Person category and Tesla belongs to Company. An entity description is a paragraph that describes the attributes and characteristics of an entity, for example, "Tesla, Inc. is an American electric vehicle and clean energy company with headquarters in Palo Alto, California".
The textual description-based model is an extension of the traditional triplet-based model that integrates additional text information to evolve its performance. We introduce these models in an expanded order based on translation, tensor factorization, and neural network methods.
The extensions of translation-based methods. Xie et al. [85] proposed a novel type-embodied knowledge graph embedding learning method (TKRL) that exploits hierarchical entity types.
It suggests that an entity may have multiple hierarchical types, and different hierarchical types should be transformed to different type-specific projection matrices.
In TKRL, for each fact (h, r, t), the entity embedding vectors h and t are first represented by using type-specific projection matrices. Let h ⊥ and t ⊥ denote the projected vectors: where M rh and M rt indicate the projection matrices related to h and t. Finally, with a translation r between two mapped entities, the scoring function is defined as the general form of the translation-based method: To capture multiple-category semantic information in entities, the projection matrix M rh /M rt (we use M rh to illustrate the approach) is generated as the weighted summation of all possible type matrices, i.e.,: where n is the number of types to which the head entity belongs, c i indicates the i-th type, M c i represents the projection matrix of c i , α i signifies the corresponding weight, and C r h denotes the collection of types that the head entity can be connected to with a given relation r. To further mine the latent information stored in hierarchical categories, the matrix M c i is designed by two types of operations: where m is the number of subcategories for the parent category c i in the hierarchical structure, c i . TKRL introduces the entity type to enhance the embedding models; another aspect of textual information is entity description. Wang et al. [86] proposed a text-enhanced knowledge graph embedding model, named TEKE, which is also an extension of translation-based models.
Given a knowledge graph, TEKE firstly builds an entity description text corpus by utilizing an entity linking tool, i.e., AIDA [87] for annotating all entities, then constructs a co-occurrence network that has the ability of calculating the frequency of co-occurrence with the entities and words. This paper advised that the rich textual information of adjacent words can effectively represent a knowledge graph. Therefore, given an entity e, the model defines the valid textual context n(e) to become its neighbors, and pairwise textual context n(h, t) = n(h) ∩ n(t) as common neighbors between the head and tail entity for each relation r.
Then, the pointwise textual context embedding vector n(e) is obtained through the word embedding toolkit. The pairwise textual context vector n(h, t) is a yield following a similar way. Finally, these feature vectors that capture the textual information are employed to calculate the score, such as TransE: (50) where A and B denote the weight matrices, h, r, and t are the bias vectors. The score function f r (h, t) is formulated as follows: The extensions of tensor factorization-based methods. Apart from taking advantage of triplet information, Krompaß et al. [88] proposed an improved representation learning model for tensor factorization-based methods. It integrates prior type-constraint knowledge with triplet facts in original models, such as RESCAL, and achieves impressive performance in link prediction tasks.
Entities in large-scale KGs generally have one or multifarious predefined types. The types of head and tail entities in a specific relation are constrained, which refers to type-constraints. For instance, the relation MarryTo is reasonably only appropriate for the Person, to which the head and tail entities belong. In this model, head k , as an indication vector, denotes the head entity types that satisfy the type-constraints of relation k; tail k is also an indication vector index for the tail entity constraints of relation k.
The main difference between this model and RESCAL is that the novel model indexes only those latent embeddings of entities related to the relation type k, compared with indexing the whole matrix A, as shown in Equation (21), in RESCAL.
Here, A [head k ,:] indicates the indexing of head k rows from the matrix A, and A [head k ,:] is the indexing of tail k . As a result of its simplification, this model shows a shorter iteration time and it is more suitable for large-scale KGs.
The extensions of neural network-based methods. The detailed description information associated with entities and relations is exited in most practical large-scale knowledge graphs. For instance, the entity Elon Musk has the particular description: Elon Musk is an entrepreneur, investor, and engineer. He holds South African, Canadian, and U.S. citizenship and is the founder, CEO, and lead designer of SpaceX. Xie et al. [89] provided a description-embodied knowledge graph embedding method (DKRL), which can integrate textual information into the presentation model. It used two embedding models to encode the semantic descriptions of the entity to enhance the representation learning model.
In this model, each head and tail entity is transformed to two vector representations, the one is the structure-based embedding vector h s /t s ∈ R d , which represents its name or index information and another one is the description-based vector h d /t d ∈ R d , which captures descriptive text information of the entity. The relation is also embedded to a R d feature space. DKRL introduces two types of the encoder to build the description-based embedding model, including a continuous bag-of-words (CBOW) method and CNN method. The score function is expressed as a modified version of TransE: Another text-enhanced knowledge graph embedding framework [35] has been proposed that discovers latent semantic information from additional text to refine the performance of original embedding models. It can represent a specific entity or relation to different feature vectors depending on diverse textual description information. Figure 6 shows a visual process for this model. First, by employing the entity linking tool [86], it can gain the entity text descriptions at the top of the figure. To explore accurate relation-mentioned sentences for a specific fact (h, r, t), the candidates should contain both the marked entities h and t and one or more hyponym/synonym words corresponding to relation r, called mention extraction. Then, the obtained entity text descriptions and mentions are fed into a two-layer bidirectional recurrent neural network (Bi-RNN) to yield high-level text representations. After that, a mutual attention layer [90] that achieves success in various tasks is employed to refine these two representations. Finally, the structure-based representations that previous models generated are associated with the learned textual representations to calculate the embedding vectors.
where α ∈ [0, 1] denotes the weight factor, h s , r s and t s ∈ R d are the embedding vectors learned from the structural information, h d , r d and t d ∈ R d indicate the distributional representations of textual descriptions, and h f inal , r f inal , and t f inal ∈ R d are the text-enhanced representations forming the final output of this model.

Relation Path-Based Models
Another category of additional information that can refine the performance of an embedding model is multi-step relational paths, which reveal one or more semantic relations between the entities. Despite that some existing studies, such as TransE, have been a great success in modeling triplet facts, the above approaches only focus on model one-step relation between two entities. Be it different from one-step relation, multi-step relation comprises a sequence of relation paths r 1 → r 2 → ... → r l , and involves more inference information. As an example, the relational path "Elon Musk BornIn −→ Pretoria StateIn −→ South A f rica" demonstrates a relation Nationality between Elon Musk and South A f rica, i.e., (Elon Musk, Nationality, South A f rica).
Lin et al. [30] firstly considered introducing the multi-step relation paths to knowledge graph representation learning. They proposed a path-based embedding model, named PtransE. For each triplet (h, r, t), r denotes the direct relation connecting h and t and P(h, t) = {p 1 , p 2 , ..., p N } is the multi-step relation paths collection between h and t, where p = r 1 → r 2 → ... → r i l is an instance of path. The score function F r (h, t) is defined as: where f r (h, t) indicates the normal triplet-based energy score function, which is equal to Equation (3). f P (h, t) reflects the authenticity between h and t with multi-step relation paths, and it can be obtained by the below equation: Here, R(p|h, t) indicates the relation path confidence level of p that is acquired via a network-based resource allocation algorithm [91]. Z = ∑ p∈P(h,t) R(p|h, t) is a normalization factor and f p (h, t) signifies the score function for (h, p, t). Therefore, the final problem is how to integrate various relations on the multi-step relation paths p to a uniform embedding vector p, as illustrated in Figure 7.
For this challenge, PtransE applies three representative types of semantic composition operations to incorporate the multiple relation r 1 → r 2 → ... → r l into an embedding vector p, including addition (ADD), multiplication (MUL), and recurrent neural network (RNN): where r l denotes the embedding vector of relation r l , c i represents the combined relation vector at the i-th relation, W indicates a composition matrix, [c i−1 ; r i ] signifies the concatenation of c i−1 , and r i .
The ADD and MUL versions of PtransE are the extensions of translation-based methods; the RNN version of PtransE and a similar approach proposed by Neelakantan et al. [92] can be considered as the neural network methods' extension in multiple relation paths. Next, we introduce an extension method based on tensor factorization. Compared with RESCAL, Guu et al. [93] extended the model by importing multiple relation paths as additional information to enhance the performance. It leverages the multiplication composition to combine different semantic inference information from the relations. The evolved f p (h, t) scoring function corresponding to (h, p, t) is defined as follows: where (M) are the latent semantic meanings in the relation. The other training processes are the same as previous works and it achieved better performance in answering path queries.

Other Models
Apart from the above methods that use additional information such as textual description and relation paths in Section 2.3, there are also some studies that introduce other information into triplet facts to evolve the traditional methods.
Reference [94,95] suggests that it is necessary to pay attention to the temporal aspects of triplets. For example, (Steve Jobs, DiedIn, Cali f ornia) happened on 2011; given a date after 2011, it is improper to obtain relations such as WorkAt where the head entity is Steve Jobs. They employ both the observed triplets and the temporal order information of the triplets into the embedding space. Feng et al. [96] proposed a graph-aware knowledge embedding method that considers a KG as a directed graph, applying the structural information of the KG to generate the representations for entities and relations. For more details about graph embedding, please refer to [36].

Applications Based on Knowledge Graph Embedding
After introducing the existing knowledge graph embedding methods, we explore diverse tasks and applications that utilize these methods to yield a benefit in this section. These are two types of tasks that are employed to evaluate the performance of embedding models on the most proposed methods, including link prediction and triplet classification. In addition, we introduce the broader domains in which the KG embedding technique can be employed and make contributions, including intelligent question answering systems, recommender systems, and so forth.

Link Prediction
Link prediction is a common knowledge graph representation application, the goal is to predict a lacking entity when given a concrete entity and relation. More exactly, it intends to predict h when given (r, t) or r when given (h, t), indicated as (h, r, ?) and (?, r, t). For instance, (?, FounderOf, SpaceX) means to predict who the founder of SpaceX is, and ((SpaceX, LocatedIn, ?) is to inquire about the location of SpaceX.

Benchmark Datasets
Bordes et al. released two datasets for the KG embedding experiment: FB15K [19], which is extracted from Freebase, and WN18 [21], which is generated from WordNet (the datasets are available from https://www.hds.utc.fr/everest/doku.php?id=en:transe). These two datasets are the most extensive benchmarks for this application. Elaborate statistics of these two datasets are shown in Table 5. Given a test triplet (h, r, t), the true tail entity is replaced by every entity e in the entity candidate set. For each iteration, the authenticity of the corrupted triplet ((h, r, e)) is calculated according to the score function f r (h, e). After scoring all triplets in descending order, the rank of (h, r, ?) is acquired. This whole process is also suitable for the situation with lacking the head entity, i.e., (?, r, t). Further more, two metrics are introduced to evaluate the performance of models: the averaged rank of the correct entities (Mean Rank) and the proportion of the correct entities that ranked in top 10 (HITS@10).
However, not every corrupted triplet is incorrect, some generated triplets also exist in the knowledge graph. In this situation, the rank of the corrupted triplet may be ahead of the original one. To eliminate this problem, all corrupted triplets in which it appears either in the train, validation, or test datasets are filtered out. The raw dataset is named "Raw" and the filtered one is named "Filter". It is worth to note that the scores with higher HITS@10 and lower Mean Rank denote better performance.

Overall Experimental Results
The detailed results of the link prediction experiment are shown in Table 6. It can be observed that: • Overall, knowledge graph embedding approaches have made impressive progress in the development of these years. For instance, HITS@10(%) in WN18 has improved from the initial 52.8% that RESCAL yielded to 96.4% that R-GCN obtained.
• R-GCN achieves the best performance in the WN18 dataset, but in another dataset, it is not one of the best models. The reason is that R-GCN has to collect all information about neighbors that connect to a specific entity with one or more relations. In WN18, there are only 18 types of relations, it is easy to calculate and generalize. However, FB15K has 1345 types of relations, the computational complexity has increased exponentially for R-GCN, which is why its performance has declined. • QuatE is superior to all existing methods in FB15K datasets, and is also the second best performing in WN18. It demonstrates that capturing hidden inter-dependency between entities and relations in four-, space is a benefit for knowledge graph representation.

•
Compared with the triplet-based models, these description-based models do not yield higher performance in this task. It reveals that external textual information is not fully utilized and exploited; researchers can take advantage of this external information to improve performance in the future.

•
In the past two years, the performance of models has not improved much on these two datasets.
The most likely reason is that existing methods have already reached the upper bound of performance, so this field needs to introduce new evaluation indicators or benchmark datasets to solve this problem.

Triplet Classification
Triplet classification is regarded as a binary classification task that focuses on estimating the authenticity of a triplet (h, r, t), proposed by Socher et al. [20].

Benchmark Datasets
Similar to link prediction, this application also has two benchmark datasets, named WN11 and FB13, extracted from WordNet and Freebase (he datasets are available from http://www.socher.org/ index.php), respectively. Detailed statistics of the two datasets are shown in Table 5.

Evaluation Protocol
Given a test triplet (h, r, t), a score is calculated via a score function f r (h, t). If this score is above a specific threshold σ r , the corresponding triplet is classified as positive, otherwise as negative. By maximizing the classification accuracy on the validation dataset, we can determine the value of threshold σ r .

Overall Experimental Results
Detailed results of the triplet classification experiment are shown in Table 7. It can be observed that:

•
In summary, these knowledge graph representation learning models have achieved a greater improvement on the WN11 dataset than FB13 because there is twice as much training samples in FB13 as in WN11, but relations between the two datasets are similar in number. This also means that FB13 has more data to train embedding models, thus it improves the generalization ability of models and makes their performance gap smaller.

•
In the triplet-based models, TransG outperforms all existing methods in the benchmark datasets. It reveals that multiple semantics for each relation would refine the performance of models. • Similar to the last task, the description-based models also do not yield impressive improvements in triplet classification application. Especially in recent years, few articles utilize additional textual or path information to improve the performance of models. There is still a good deal of improvement space be achieved with additional information for knowledge graph embedding.

Other Applications
Apart from the aforementioned applications that are appropriate for KG embeddings, there are other wider fields to which KG representation learning could be applied and play a significant role.
Question answering (QA) systems are a perpetual topic in artificial intelligence, in which the goal is teaching machines to understand questions in the form of natural language and return a precise answer. For the past few years, the QA systems based on KGs have received much attention and some studies have been proposed in this direction [97,98]. The core idea of these methods is to embed both a KG and question into a low-, vector space to make the embedding vector of the question and its corresponding answer as close as possible. This technology also can be used in recommender systems [99,100], which are systems capable of advising users regarding items they want to purchase or hold, and other promising application domains.
However, the applications based on KG embedding are still in their initial stages; in particular, there are few related studies in external applications, such as question answering, and the domains in which researchers are concerned are very limited. Thus, this direction holds great potential for future research.

Conclusions and Future Prospects
In this paper, we provide a systematic review of KG representation learning. We introduce the existing models in concise words and describe several tasks that utilize KG embedding. More particularly, we first introduce the embedding models that only apply triplet facts as inputs, and further mention the advanced approaches that leverage additional information to enhance the performance of the original models. After that, we introduce a variety of applications including link prediction and triplet classification, etc. However, the studies on KG embedding are far from mature, and extensive efforts are still required in this field.
To the best of our knowledge, these are three research directions that can be extended: (i) Although those models that utilize the information of additional semantic information are more efficient than the triplet fact-based models, the types of information they can incorporate are extremely limited. The available multivariate information such as hierarchical descriptions between entities/relations in KG, textual Internet information, and even the extracted information from other KGs, can also be applied to refine the representation performance of the embedding models. (ii) The capability of knowledge inference is also a significant part in knowledge graph embedding models. For instance, there is no direct relationship between the head entity (h) and tail entity t in the original KG, but we can explore the inherent connection between these two entities by employing a KG embedding model; this would be greatly beneficial for question answering systems and other applications. Nonetheless, when the relation paths between entities become longer, existing models cannot effectively solve multiple complex relation path problems. Using deep learning technology to integrate all relevant semantic information and represent it uniformly is an alternative solution to this issue. (iii) Although KG representation learning applies to relatively few domains/applications at present, by adopting new technologies such as transfer learning [101], existing models could be applied to a new field with minor adjustments; we believe that this technique will expand to more domains and bring a great improvement compared to traditional methods. We hope that this brief future outlook will provide new ideas and insights for researchers.

Conflicts of Interest:
The authors declare no conflict of interest.