On Training Knowledge Graph Embedding Models

: Knowledge graph embedding (KGE) models have become popular means for making discoveries in knowledge graphs (e


Introduction
The recent advent of knowledge graph embedding (KGE) models has allowed for scalable and efficient manipulation of large knowledge graphs (KGs) such as RDF Graphs, improving the results of a wide range of tasks such as link prediction [1][2][3], entity resolution [4,5] and entity classification [6]. KGE models operate by learning embeddings in a low-dimensional continuous space from the relational information contained in the KG while preserving its inherent structure. Specifically, their objective is to rank knowledge facts-relational triples (s, p, o) connecting subject and object entities s and o by a relation type p-based on their relevance. Various interactions between their entity and relation embeddings are used for computing the knowledge fact ranking. These interactions are typically reflected in a model-specific scoring function.
For instance, TransE [1] uses a scoring function defined as the distance between the o embedding and the translation of the embedding associated with s by the relation type p embedding. DistMult [7], ComplEx [8] and HolE [9] use multiplicative composition of the entity embeddings and the relation type embeddings. This leads to a better reflection of the relational semantics and leads to state-of-the-art performance results (refer to [10] for a review). Although there is a growing body of literature proposing different KG models (mostly focusing on the design of new scoring functions), other parts of the knowledge graph embedding learning process, e.g., loss functions, and negative sampling strategies have not received much attention to date [11].
This has already been shown to influence the behaviour of the KGE models. For instance, Ref. [12] observed that, despite the different motivations behind HolE and CompleEx models, they have equivalent scoring functions. However, their performance still differs. Trouillon et al. [13] have concluded that this difference is caused by the fact that HolE uses a max-margin loss while ComplEx a log-likelihood loss. This shows that cost functions are important for thorough understanding, and even improvement of the performance of different KGE models.

1.
We provide a comprehensive analysis of training loss functions as used in several state-of-the-art KGE models in Section 3. We also preform an empirical evaluation of different KGE models with different loss functions and we show the effect of these losses on the KGE models predictive accuracy.

2.
We study negative sampling strategies and we examine their effects on the accuracy and scalability of KGE models. 3.
We study the effects of changes in the different hyperparameters and their effects on the accuracy and scalability of KGE models during the training process.

Background
In this section, we discuss the concepts and notation used throughout the study. We discuss knowledge graph embedding models, learn to rank training objectives and the different metrics for evaluating knowledge graph embedding models in the task of link prediction. We use the following notation in the rest of the paper. Let X = {x 1 , x 2 , ..., x n } be a set of objects (triples) to be ranked. Let l : X → N be a labelling function where l(x) is the label of object x (e.g., true/false or an arbitrary integer in case of multi-label problems). By default, we assume a binary labelling for triples, l : X → {0, 1}, where 0 and 1 represent false and true labels, respectively.
Let F = { f 1 , f 2 , ..., f n } be the set of possible scoring functions. Given a KGE model, f : X → R is its scoring function, where f aims to score triples in X such that positive triples are scored higher than negative ones, formally, ∀x i , x j ∈ X l(x i ) > l(x j ) =⇒ f (x i ) > f (x j ). Finally, R(x i , f ) denotes the rank position of element x i according to scoring function f , or the position of score f (x i ) in a descending order of all scores f (x) for all x ∈ X.

Loss Functions in Learning to Rank
In learning to rank, models typically use loss functions to learn the optimal scoring function parameters from training data. The learning process then involves minimising a loss function defined on the basis of the objects, their labels, and the scoring function. In learning to rank models, the objective is to score a set of labelled objects such that: for each two objects with different label values, the object with greater label value also has greater model score. Next, we present the several approaches proposed in learning to rank to learn optimal scoring functions.

•
Pointwise approach. The loss function is defined in terms of the difference between the element's predicted score and its actual label value. The formula is as follows: where φ is a transformation function, e.g., the square function φ(x) = x 2 as in Bayes optimal subset ranking [19] and RESCAL KGE model [4].

•
Pairwise approach. The loss is defined as the summation of the differences between the predicted score of an element and the scores of other elements with a smaller labels' value. The formula is as follows: where the function φ can be the hinge function as in Translating Embeddings model [1] or the exponential function as in RankBoost [20]. • Listwise approach. The loss is defined as a comparison between the rank permutation probabilities of model scores and values of actual labels [21]. Let φ(x) be an increasing and strictly positive function. We define the probability of an object being ranked on the top (a.k.a. top one probability), given the scores of all the objects as: where f (x i ) is the score of object i, i = 1, 2, . . . , n. The listwise loss can then be defined as: where L m is a model-dependent loss and l(x) is a labelling function which returns the true label of x. Possible examples include cross-entropy in ListNet [21] or likelihood loss as in ListMLE [22].

Knowledge Graph Embedding Process
Knowledge graph embedding models learn low rank vector representation i.e., embeddings for graph entities and relations. In the link prediction task, they learn embeddings in order to rank knowledge graph facts according to their factuality. The process of learning these embeddings consists of different phases as shown in Figure 1. First, they initialise the embeddings of both relations and entities using random noise. These embeddings are then used to score a set of true and false facts, where the scores of facts are generated by computing the interaction between their subject, predicate and object embeddings using a model dependent scoring function. Finally, embeddings are updated through a gradient descent routine which minimises a training loss that usually represents a min-max loss over the scored facts. The objective is then to maximise the scores of true facts and minimise the scores of other facts.

Negative Sampling
In learning to rank approaches, models use a ranking loss, e.g., pointwise or pairwise loss, to rank a set of true and negative instances [23], where negative instances are generated by corrupting true training facts with a given ratio of negative to positive instances [1]. This corruption happens by changing either the subject or object of the true triple instance. In this configuration, the ratio of negative to positive instances is traditionally learnt using a grid search, where models compromise between the accuracy achieved by increasing the ratio and the runtime required for training.
On the other hand, multi-class based models train to rank positive triples against all their possible corruptions as a multi-class problem, where the range of classes is the set of all entities. For example, training on a triple (s, p, o) is achieved by learning the right classes "s" and "o" for the pairs (?, p, o) and (s, p, ?), respectively, where the set of possible classes is E of size N e . Despite the enhancement of predictions' accuracy achieved by such approaches [14,15], their negative sampling procedure is exhaustive and requires a high space complexity due to the usage of the whole entity vocabulary per each triple.

Embedding Interactions
After KGE models produce negative samples from input triples, they generate scores for both the true and negative (corrupted) triples. These scores are generated using embedding interaction function, i.e., scoring functions. First, the model looks up the embeddings of the subject, predicate and object of triples. Then, the model uses an embedding interaction function to learn a score for each triple using its embeddings.
The embedding interaction functions are model-dependent and they operate using different approaches such as embedding translation [1], linear products [7] and convolutional filters [14]. For example, the TransE model uses a translation-based scoring function which encodes embedding interactions as a translation from the subject embedding vector to the object embedding vector through the predicate vector [1]. Such an approach allowed highly scalable knowledge graph embedding with linear time and space complexity. However, it suffered from limited ability to encode 1-to-many predicates in knowledge graphs due to dependence on direct additive translations [7]. Later models such as DistMult used a linear product-based scoring functions which allowed better encoding of 1-to-many predicates while preserving the linear time and space complexity. However, DistMult model's scoring function suffered of limited ability to preserve the predicate direction due to a dependence on a symmetric operation [7].
More recent approaches, such as ComplEx [8], ConvE [14], TriModel [17], among others, propose new scoring mechanisms which allow for encoding both 1-to-many relations and preserve the predicate directionality within linear time and space complexity. Since these scoring functions are well covered in previous studies, we will not discuss the details of their mechanisms-of-action in our study. For further information and technical details about these knowledge graph embedding scoring functions, we refer the readers to the reviews of Nickel et al. [9] and Wang et al. [10].
In the following, we provide a further discussion and definition of the embedding interaction functions of the models that we examine in our study, where we discuss them within their embedding strategy. •

Distance-based embeddings' interactions
The Translating Embedding model (TransE) [1] is one of the early models that use distance between embeddings to generate triple scores. It interprets triple's embeddings interactions as a linear translation of the subject to the object such that e s + w p = e o , and generates a score for a triple as follows: where true facts have zero score and false facts have higher scores. This approach provides scalable and efficient embeddings learning as it has linear time and space complexity. However, it fails to provide efficient representation for interactions in one-tomany, many-to-many and many-to-one predicates as its design assumes one object per each subject-predicate combination: •

Factorisation-based embedding interactions
Interactions based on embedding factorisation provide better representation for predicates with high cardinality. They have been adopted in models like DistMult [24] and ComplEx [8]. The DistMult model uses the bilinear product of embeddings of the subject, the predicate, and the object as their interaction, and its scoring function is defined as follows: where e s k is the k-th component of subject entity s embedding vector e s . DistMult achieved a significant improvement in accuracy in the task of link prediction over models like TransE. However, the symmetry of embedding scoring functions affects its predictive power on asymmetric predicates as it cannot capture the direction of the predicate. On the other hand, the ComplEx model uses embedding in a complex form to model data with asymmetry. It models embeddings interactions using the the product of complex embeddings, and its scores are defined as follows: where Re(x) represents the real part of complex number x and all embeddings are in complex form such that e, w ∈ C, e r and e i are respectively the real and imaginary parts of e, and e o is the complex conjugate of the object embeddings e o such that e o = e r o − ie i o and this introduces asymmetry to the scoring function. Using this notation, ComplEx can handle data with asymmetric predicates and keep scores in the real spaces; it only uses the real part of embeddings' product outcome. ComplEx preserves both linear time and linear space complexities as in TransE and DistMult; however, it surpasses their accuracy in the task of link prediction due to its ability to model a wider set of predicate types.

Ranking Evaluation Metrics
Learning to rank models are evaluated using different ranking measures including Mean Average Precision (MAP), Normalised Discounted Cumulative Gain (NDCG), and Mean Reciprocal Rank (MRR) [25]. Below, we discuss MAP and MRR based on a set of queries Q = {q 1 , q 2 , . . . , q n }. (Note that our experiments will also use Hits@k metric for the model comparison; our proposed cost function design is independent of that metric and therefore we do not provide detailed definitions here.)

•
Mean Average Precision (MAP). MAP is a ranking measure that evaluates the quality of a rank depending on the whole rank of its true (relevant) elements. First, we need to define Precision at position k (denoted as P@k): where x ∈ q and I(l, x) is an indicator function that is equal to 1 when x is a relevant element and 0 otherwise. The Average Precision (AP) is defined by: where n is the total number of objects associated with query q, and n 1 is the number of objects with label one. The MAP is then defined as the mean of AP over all queries Q: • Mean Reciprocal Rank (MRR). The Reciprocal Rank (RR) is a statistical measure used to evaluate the response of ranking models depending on the rank of the first correct answer. The MRR is the average of the reciprocal ranks of results for different queries in Q. Formally, MRR is defined as: where x i is the highest ranked relevant item for query q i . Values of RR and MRR have a maximum of 1 for queries with true items ranked first, and get closer to 0 when the first true item is ranked in lower positions. Therefore, we can define the MRR error as 1 − MRR. This error starts from 0 when the first true item is ranked first, and increases towards 1 for less successful rankings. • Hits@k. This metric represents the number of correct elements predicted among the top-k elements in a rank, where we use Hits@1, Hits@3 and Hits@10. This metric indicates that the model's probability of ranking a relevant (true) fact in the top-k element scores in the rank.

Experimental Evaluation
For this study, we have selected three highly studied models in the state of the art and selected five commonly used datasets for link prediction. Our goal here is to learn knowledge graph embeddings for each dataset applying different losses and analysing their performance. We describe the setup of our experiments using the TransE [1], DistMult [7], and ComplEx [8] KGE models. Then, we present the characteristics of the benchmarking datasets, and also provide the relevant implementation details. Table 1 contains statistics of the five benchmarking datasets used in our experiments (All the benchmarking datasets can be downloaded using the following URL: https: //doi.org/10.6084/m9.figshare.14213894, accessed on 20 March 2021). Three of them, namely, WN18RR, FB15k-237, and YAGO10 are among the most frequently used datasets for benchmarking KGE models. In addition, here we consider the NELL239 and PSE, which contain a higher number of entities and triples for studying the scalability of KGE models. Benchmarking Datasets. In our experiments, we use five knowledge graph benchmarking datasets: • NELL239: a subsets of the NELL dataset [26,27] which contains general knowledge about people, places, teams, universities, among other entity types [17]. • WN18: a subset of the WordNet dataset [28] containing lexical information of the English language [1,14]. • FB15k-237: a subset of the Freebase dataset [29] that contains information about general human knowledge [30]. • YAGO10: a subset of the YAGO3 dataset [31] containing information mostly about people and their citizenship, gender, and professions knowledge [32]. • PSE: a poly-pharmacy side-effects dataset [33] containing facts about drug combinations and their related side-effects. The dataset was introduced by Zitnik et al. [33] to study modelling poly-pharmacy side-effects using knowledge graph embedding models. Since the dataset is significantly larger than the available standard benchmark we use it to study the effects of hyperparameters and accuracy of the knowledge graph embedding models.
Evaluation Protocol. We evaluate the KGE models using a unified protocol that assesses their performance in the task of link prediction. Let X be the set of facts, i.e., triples; Θ E be the embeddings of entities E, and Θ R be the embeddings of relations R.
The KGE evaluation protocol works in three steps: (1) Corruption: For each x = (s, p, o) ∈ X, x is corrupted 2|E| − 1 times by replacing its subject and object entities with all the other entities in E. The corrupted triples can be defined as: where s s and o o. These corruptions effectively provide negative examples for the supervised training and testing process due to the Local Closed World Assumption [34], frequently adopted for knowledge graph mining tasks.
(2) Scoring: Both original triples and corrupted instances are evaluated using a modeldependent scoring function. This process involves looking up embeddings of entities and relations, and computing scores depending on these embeddings. (3) Evaluation: Each triple and its corresponding corruption triples are evaluated using the RR ranking metric as a separate query, where the original triples represent true objects and their corruptions false ones. It is possible that corruptions of triples may contain positive instances that exist among training or validation triples. In our experiments, we alleviate this problem by filtering out positive instances in the triple corruptions. Therefore, MRR and Hits@k are computed using the knowledge graph original triples and non-positive corruptions only [1].

Loss Functions in KGE Models
Generally, KGE models are cast as learning to rank problems. They employ multiple training loss functions that comply with the ranking loss approaches. In the state-of-the-art KGE models, loss functions have been designed according to various pointwise and pairwise approaches that we review next.

KGE Pointwise Losses
First, we discuss current pointwise loss functions for KGE models including SE, hinge, and logistic losses. We then propose a new pointwise loss function, namely, the Pointwise Square Loss (PSL) that combines the square growth of SE and the configurable margin of hinge loss.

•
Pointwise square error loss (SE). It is a pointwise ranking loss function used in RESCAL [4]. It models training losses with the objective of minimising the squared difference between model predicted scores for triples and their true labels: The optimal score for true and false facts is 1 and 0, respectively. A nice to have characteristic of the SE loss is that it does not require configurable training parameters, shrinking the search space of hyperparameters compared to other losses (e.g., the margin parameter of the hinge loss). • Pointwise hinge loss. Hinge loss can be interpreted as a pointwise loss, where the objective is to generally minimise the scores of negative facts and maximise the scores of positive facts to a specific configurable value. This approach is used in HolE [9], and it is defined as: where Pt is an abbreviation for pointwise to clarify the type of the loss, l(x) = 1 if x is true and −1 otherwise, and [x] + denotes the max(x, 0) function. This effectively generates two different loss slopes for positive and negative scores as shown in Figure 2. Thus, the objective resembles a pointwise loss that minimises negative scores to reach −λ, and maximises positives scores to reach λ. • Pointwise logistic loss. The ComplEx [8] model uses a logistic loss, which is a smoother version of pointwise hinge loss without the configurable margin parameter. Logistic loss uses a logistic function to minimise the negative triples score and maximise the positive triples score. This is similar to hinge loss, but uses a smoother linear loss slope defined as: where l(x) is the true label of fact x, which is equal to 1 for positive facts and equal to −1 otherwise.

KGE Pairwise Losses
Here, we discuss established pairwise loss functions in KGE model which are summarised in Figure 3.
• Pairwise hinge loss. Hinge loss is a linear learning to rank loss that can be implemented in both a pointwise or pairwise loss settings. In both, the TransE [1] and DistMult [7] models the hinge loss is used in its pairwise form and defined as follows: where the term Pr is an abbreviation for pairwise, X + is the set of true facts, X − is the set of false facts, and λ is a configurable margin that separates positive from negative facts.
In this case, the objective is to minimise the marginal difference (difference of scores with the added margin) between the scores of negative and positive instances. This approach optimises towards having embeddings that satisfy Figure 3.
• Pairwise logistic loss. Logistic loss can also be interpreted as pairwise margin based loss following the same approach as in hinge loss. The loss is then defined as: where the objective is to minimise the marginal difference between negative and positive scores with a smoother linear slope than hinge loss as shown in Figure 3.

KGE Multi-Class Losses
In the following, we discuss KGE loss functions that are used to cast the KGE process into a multi-class classification problem.

•
Binary cross entropy loss (BCE). The authors of the ConvE model [14] proposed a new binary cross entropy multi-class loss to model the training error of KGE models in link prediction. In this setting, the whole vocabulary of entities is used to train each positive fact such that for a triple (s, p, o), all facts (s, p, o ) with o ∈ E and o o are considered false. The BCE loss can be defined as follows: where l spo is the true label of triplet (s, p, o). Despite the extra computational cost of this approach, it allowed ConvE to generalise over a larger sample of negative instances, therefore surpassing other approaches in accuracy [14].

•
Negative-log softmax loss (NLS). In a recent work, Lacroix et al. [15] introduced a soft-max regression loss to model training error of the ComplEx model as a multi-class problem. In this approach, the objective for each (s, p, o) triple is to minimise the following loss: where φ spo is the model score for the triple (s, p, o), and s ∈ E, s s, o ∈ E and o o. This resembles a log-loss of the soft-max value of the positive triple compared to all possible object and subject corruptions where the objective is to maximise positive fact scores and minimise all other scores. This approach achieved significant improvement to the prediction accuracy of ComplEx model over all benchmark datasets when used with the 3-nuclear norm regularisation of embeddings [15].

Effects of Training Objectives on Accuracy
We performed an experimental evaluation for the effect of loss function on the accuracy of KGE models in the link prediction task in terms of MRR and Hits@10. For simplicity, we have only experimented with three KGE models: TransE, DistMult and ComplEx. These models are used as examples where we assess their performance on different benchmarks under different loss function configurations. Table 2 summarises the outcome of our experiments, comparing the accuracy of the models on both ranking and multi-class loss approaches in terms of MRR and Hits@10. Interestingly, in the context of the ranking loss configuration, we can observe that the default loss functions of the examined models do not always yield the best results. On the contrary, The DistMult and ComplEx models which by default use the pointwise hinge and logistic losses, respectively, obtain their best result using the pointwise square error loss with all the examined benchmarks. This shows that changing the default loss function of these models can help enhance their predictive accuracy. The results of the TransE model also show that its default loss configuration achieves the best results on 4 out of 6 examined evaluation metrics. On the other hand, the pairwise logistic loss configuration of TransE achieves best results in 3 out of 6 of the examined evaluation metrics. Given that the pairwise logistic loss does not require parameters-unlike the margin-based hinge loss-the pairwise logistic loss can be a preferred configuration for TransE as it can significantly reduce the grid search time.
Our results also show that the multi-class loss versions of the CP, DistMult, ComplEx models have significantly better results than their ranking based losses. For example, on the NELL239 dataset, the best performing ranking-based approach, the complex model with the pointwise square error loss, achieves 0.35 and 0.52 scores in terms of MRR and Hits@10, respectively, compared to its NLS loss version which achieves 0.40 and 0.58 scores, respectively. The results also show that the best multi-class loss results are obtained using the negative log soft-max loss. It is noteworthy that both multi-class based versions of the DistMult and ComplEx achieve their best results using the negative soft-max loss.

Effects of Training Objectives on Scalability
We have shown that different training objectives yield significantly different results for the same KGE models. Our experimental results also suggested that the multi-class loss functions achieve the best results in terms of MRR and Hits@10 on all the investigated datasets. However, this approach uses a 1-vs.-all negative sampling strategy which is time-consuming due to its higher time and space complexity compared to usual 1-vs.-n sampling. In the following, we compare the ranking losses and multi-class losses in terms of the runtime required for training a KGE on different dataset size to study the scalability of both approaches. Table 2. Link prediction results for KGE models with different loss functions on standard benchmarking datasets. The abbreviations MC, Pr, Pt stand for multi-class, pairwise and pointwise, respectively. The * mark is used to denote the model's original loss function as first proposed by the authors. In the ranking losses, the best results are computed per model and highlighted using bold font, and underlined values represent the best result in each respective loss approach. We execute an experiment using the YAGO10 benchmark, where we train KGE models on different percentages of the dataset and study the relation between the growth of the dataset size and the required training runtime. Moreover, we compare the training runtime of different KGE models with the negative soft-max loss (NLS) and the pointwise square error loss as representatives of their respective loss class. Figure 4 shows the outcomes of our experiments, presenting a series of plots which describe the relation between the growth of the dataset size and the growth of training runtime of both ranking and multi-class loss approaches. As expected, the results show that the multi-class losses have significantly higher training runtime than ranking losses for all the KGE models. The runtime of the ranking loss functions appears to be constant; however, it is growing with a constant increase related to the growth of the training data. On the other hand, the multi-class loss functions have a linear growth which correlates to the growth of the training data. This shows the significant difference between both training loss approaches in terms of the scalability of the training process.

Model
The results also show that the multi-class losses have different growth slopes than the DistMult, ComplEx and TriModel approaches.

KGE Training Hyperparameters
In this section, we discuss the effects of training hyperparameters on both the accuracy and scalability of knowledge graph embedding models. We first discuss the effects in terms of scalability and we then discuss the implications of changes of hyperparameters on the accuracy of KGE models in the task of link prediction.

Training Hyperparameters Effects on KGE Scalability
Knowledge graph embedding models are famous for their high quality predictions with high scalability [34]. Most of the KGE methods employ linear transformations such as vector translations and vector diagonal products to learn interactions between embeddings, therefore, they operate within linear time and space complexity [8,15,17].
Despite the high scalability of the training process KGE models, they require the hyperparameters' training routine which is time consuming due to the large hyperparameters' search space. Traditionally, the hyperparameters search is executed using grid search to find the best hyperparameters for each model on each new dataset. The training hyperparameters of KGE models include embedding size, negative samples per positive, batch size, etc. Kadlec et al. [16] have shown that minor changes to these hyperparameters can yield significantly different results in terms of the models' resulting accuracy. The changes in these hyperparameters also have an impact on their runtime where changes in hyperparameters such as the embedding size and the number of sampled negatives can affect the memory space used during training.
We performed an experimental evaluation for four different KGE models: TransE, DistMult, Complex and TriModel, where we examine the effects of changes of the training hyperparameters and data size on their training runtime. We use the PSE benchmarking dataset [33]-our largest benchmarking dataset-to show the effect of hyperparameters on training runtimes of KGE models. Figure 5 shows the outcome results of our experiments across the different investigated training hyperparameters. In these experiments, we use a set of fixed hyperparameters (embedding size = 150, batch size = 2048, negative samples = 2, training iterations = 500) and we use a set of values for each of the investigated hyperparameter in their respective experiment to show the relation between the changes of the hyperparameters and the KGE model's accuracy and runtime. The plot "A" shows the relation between the training runtime and the size of the processed data. The plot shows that all the four investigated models have a linear relation between their training runtime and the data size. We can also observe that the models have a consistent growth in terms of their runtime across all the data sizes. The DistMult model consistency achieves the smallest runtime followed by the TransE, TriModel and ComplEx models. Plot "B" shows the relationship between the training runtime and the model embedding size. The plot shows that all the investigated models have a linear growth of their training runtime corresponding to the growth of the embeddings size. However, the growth rate of the TransE and DistMult models is considerably smaller than the growth of both the ComplEx and TriModel models. This occurs as both the TransE and DistMult models use a single vector to represent each of their embeddings while the ComplEx and TriModel models use two and three vectors, respectively. Despite the better scalability of both the TransE and DistMult models, the ComplEx and TriModel models generally achieve better predictive accuracy [17].
The plot "C" shows the relation between the runtime of KGE models and the number of negative samples they use during training. The plot shows that there is a positive linear relation between training runtime and the number of negative samples-where all the KGE models have similar results across all the investigated sampling sizes. The TriModel, however, consistently have the highest runtime compared to other models.
Plot "D" shows the effects of the size of the batch on the training runtime. The plot shows an exponential decay of the training runtime with the linear growth of the data batch size. The KGE models process all the training data for each training iteration (i.e., epoch), where the data are divided into batches for scalability and generalisation purposes. Therefore, the increase of the training data batch sizes leads to a decrease of the number of model executions for each training iteration. Despite the high scalability that can be achieved with large batch sizes, the best predictive accuracy is often achieved using small data batch sizes. Usually, the most efficient training data batch size is chosen during a hyperparameter grid search along with other hyperparameters such as the embedding size and the number of negative samples.
The runtime of the models reported in Figure 5 looks very similar or almost identical for all the models as they all operate using the same training procedure, where they differ only on the way they represent the embeddings and compute their interactions. This is significantly shown in the plot (B) in Figure 5, which shows a significant difference in terms of runtime between models that use single embedding vectors such as TransE and DistMult and other models which use multi-vector embeddings such as ComplEx and TriModel.
The reported predictive accuracy of the models in terms of the MRR scores in Figure 6 looks similar as the models parameters were not properly tuned. The reported models are also known to produce very similar or almost identical results on the WN18RR benchmark [17]. The purpose of the figure is also to study the change of the models predictive accuracy in relation to the change of the values of the models' hyperparameters and not the best possible model accuracy.

Analysis of the Predictive Scalability Experiments
Our experiments on the effects of training parameters of KGE models on the models' scalability suggest the following: • The results confirm that KGE models have linear time complexity as shown in Figure 5 where the models' runtime grows linearly corresponding to increase of both data size and embedding vector sizes.

•
The results also confirm that models such as the TriModel and Complex model which have more than one embedding vector for each entity, and relations require more training time compared to models with only one embedding vector.

•
The results also show that the training batch size has a significant effect on the training runtime, therefore, using larger batch sizes are suggested to significantly decrease the training runtime of the training of KGE models.

Training Hyperparameters' Effects on KGE Accuracy
In the following, we study the relation between changes in different training hyperparameters and the accuracy of KGE models. We perform an experimental evaluation where we examine the changes of the accuracy of KGE models in terms of MRR compared to the changes of the model's training hyperparameters. Figure 6 shows the outcome of our experiments where the plots present the changes on MRR corresponding to changes of the training hyperparameters of different KGE models on the NELL239, WN18RR, FB15k-237 and YAGO10 datasets.
The datasets reported in Figure 6 are sorted in a descending order from top to down in terms of the number of facts contained in the dataset (dataset size). The first row of plots corresponds to the experiments which are done on the NELL239 dataset (the smallest dataset); these results show that changing the number of training iteration (epochs) and the embedding size has a significant effect on models' accuracy. On the other hand, the negative samples and batch size hyperparameters have less effect on the models' MRR score where the change in the batch size and the number of negative samples insignificantly affects the MRR score-except by the TransE model which has a positive MRR score relation with the number of negative samples.
The results corresponding to the WN18RR dataset also show that the epoch count and embedding size have a significant effect of the model's accuracy. We can also observe that both hyperparameters have a positive relation with the MRR score of KGE models compared to the variable relation on the NELL dataset. The results also show that the MRR score of KGE models stabilises after 250 training iterations with an embedding size of 50-increases above these values do not have significant effects on the MRR score of models. On the other hand, the changes of the batch size and number of negative samples follow a similar pattern as in the NELL239 dataset: the models' MRR scores have no significant relation with the changes of both hyperparameters. However, an exception was observed for the TransE model, which has a positive MRR score relation with the number of negative samples.
The results of the YAGO10 dataset (the largest dataset in this experiment) show a positive relation between the number of training iterations and the MRR score of the TransE, TriModel and ComplEx models. Conversely, the DistMult model has a negative relation with the the number of training iterations. These results show that all models have a positive relation between their MRR scores and the size of the embeddings. For the batch size and negative samples' hyperparameters, we can observe that there is a lower correlation to the MRR scores as in the previous dataset; however, there is a noticeable low positive correlation between the number of negative samples and the MRR scores of the TransE, TriModel and ComplEx models.
Finally, based on the observations previously presented, we can provide a list of practical suggestions for anyone experimenting with KGE models:

•
Changes on the embedding vectors' size have the biggest effect on the predictive accuracy of KGE models. Thus, we suggest to carefully select this parameter by searching through a larger search space, which can help with ensuring that the models can reach their best representations of the knowledge graph.

•
The increased number of training iterations can sometimes have a negative effect on the outcome predictive accuracy. Thus, we suggest using early stopping techniques to decide when to stop model training before accuracy starts to decrease. • Both the number of negative samples and batch sizes showed a small effect on the predictive accuracy of KGE models. Thus, this parameter can either be assigned fixed values or be found using a small search spaces to help decrease the hyperparameters search space, and thus the hyperparameters' tuning runtime.

Discussion
In this section, we discuss the compromise between scalability and accuracy in the training of KGE models. We also discuss the properties of some datasets and their relation with KGE interaction functions. We finally discuss the compatibility between specific KGE scoring and loss functions.

The Compromise between Scalability and Accuracy
We have shown that KGE models achieve their best result in terms of accuracy using multi-class loss functions. However, these functions depend on the 1-vs.-all negative sampling which is time-consuming as we have shown in Section 3.5. On the other hand, KGE models with ranking-based loss function are significantly more scalable, but they have less accurate predictions compared to the multi-class losses. This variability between the capabilities of the two approaches results in a compromise between the scalability and accuracy of KGE models when choosing loss functions for KGE models. In our experiments, we found that the training runtime of multi-class loss functions is affected by the entity count in the dataset along with the dataset size where datasets with a higher number of entities require more training time than others even if they have the same size.
We ran all our experiments on GPU where we found out that the multi-class based models consume a large amount of the GPU memory. This, therefore, forced us to use small training batch sizes to fit to the GPU's memory especially on large datasets. The use of these smaller batches resulted in longer training runtime due to the increased number of training iterations over the batches. Wang et al. [10] provided a detailed study of the theoretical time-complexity of such approaches. On the other hand, ranking based loss functions have significantly lower GPU memory consumption compared to the multi-class loss functions. However, the memory consumption grows positively with relation to the number of used negative samples.
We thus suggest that KGE models with multi-class losses can be used comfortably used for training of small size (less than 5M facts) knowledge graphs. We also recommend using multiple GPUs when available for running the grid-search process of KGE models with multi-class objectives.

The Relationship between Datasets and Embedding Interaction Functions
From our experiments, it is noticeable that the tensor factorisation based methods such as the DistMult, TriModel and ComplEx models consistently have better accuracy than distance based models such as the TransE model on all benchmark in the ranking losses' configuration. However, we can also see that the TransE model significantly outperforms all other ranking based models on the FB15k-237 dataset.
A further study of Nguyen et al. [35] also shows that translation based methods achieve significantly higher accuracy than tensor factorisation based methods in terms of both MRR and Hits@10. We think that this can be due to specific properties of the dataset which is compatible with translation based embedding interaction approaches compared to tensor factorisation methods. We also intend to study this specific relation in future works where we intend to investigate different properties of knowledge graph and their possible relations to specific KGE embedding components.

Compatibility between Scoring and Loss Functions
In the ranking loss functions experiments, we can see that the TransE model achieves its best result using pairwise loss functions while its version with the pointwise loss function has significantly worse results. On the other hand, other tensor factorisation based approaches such as the DistMult and ComplEx models achieve their best results with their version which uses pointwise loss functions such as the pointwise squared error and logistic losses. The pairwise loss function versions of these models also have significantly worse results in terms of both the MRR and Hits@10 metrics on all benchmarks as shown in Table 2.

Limitations of Knowledge Graph Embedding Models
Despite the high accuracy and scalability of knowledge graph embedding models, they suffer from various limitations that we discuss in the following: • Lack of interpretability. In knowledge graph embedding models, the learning objective is to model nodes and edges of the graph using low-rank vector embeddings that preserve the graph's coherent structure. The embedding learning procedure operates mainly by transforming noise vectors to useful embeddings using gradient decent optimisation on a specific objective loss. These procedures, however, work as a black box that is hard to interpret compared to other association rule mining and graph traversal approaches that can be interpreted based on the features they use.

•
Sensitivity to data quality. KGE models generate vector representations of entities according to their prior knowledge. Therefore, the quality of this knowledge affects the quality of the generated embeddings. • Hyperparameter sensitivity. The outcome predictive accuracy of KGE embeddings is sensitive to their hyperparameters [16]. Therefore, minor changes in these parameters can have significant effects on the outcome predictive accuracy of KGE models. The process of finding the optimal parameters of KGE models is traditionally achieved through an exhausting brute-force parameter search. As a result, their training may require rather time-consuming grid search procedures to find the right parameters for each new dataset.

Future Directions
We intend to extend the scope of this study to examine a new set of models such as the convolution based models and graph neural network based KGE models to further validate the findings of our study. We also intend to examine re-evaluating state-of-the-art approaches with various loss function configurations to see how these loss objectives can affect their predictive accuracy on the standard evaluation benchmarks.

Conclusions
In this study, we have examined different approaches for defining training objectives in KGE models where we have studied their effects on the models' accuracy and scalability. We have then shown by experimental evaluation that most of the studied KGE models achieve better accuracy in the link prediction task using a different loss function than their originally published functions.
We have also studied the differences between multi-class based loss functions and ranking based loss functions and their associated negative sampling strategies with a focus on their relation to the scalability and accuracy of the KGE models. Finally, we have studied the hyperparameter tuning process in the KGE models' training process, and we examined the effect of different training hyperparameters on KGE models' accuracy and scalability.
Author Contributions: S.K.M. and E.M. theoretical study, analysis, findings and wrote the manuscript. V.N. fund acquisition, design review, manuscript review and supervise; All authors discussed the results and contributed to the final manuscript. All authors have read and agreed to the published version of the manuscript.
Funding: Insight Centre for Data Analytics at the National University of Ireland Galway, Ireland is supported by the Science Foundation Ireland grant (12/RC/2289_P2).

Data Availability Statement:
All the benchmarking datasets used in this study can be downloaded using the following URL: https://doi.org/10.6084/m9.figshare.14213894, accessed on 20 March 2021.

Conflicts of Interest:
The authors declare no conflict of interest.