IMF: Interpretable Multi-Hop Forecasting on Temporal Knowledge Graphs

Temporal knowledge graphs (KGs) have recently attracted increasing attention. The temporal KG forecasting task, which plays a crucial role in such applications as event prediction, predicts future links based on historical facts. However, current studies pay scant attention to the following two aspects. First, the interpretability of current models is manifested in providing reasoning paths, which is an essential property of path-based models. However, the comparison of reasoning paths in these models is operated in a black-box fashion. Moreover, contemporary models utilize separate networks to evaluate paths at different hops. Although the network for each hop has the same architecture, each network achieves different parameters for better performance. Different parameters cause identical semantics to have different scores, so models cannot measure identical semantics at different hops equally. Inspired by the observation that reasoning based on multi-hop paths is akin to answering questions step by step, this paper designs an Interpretable Multi-Hop Reasoning (IMR) framework based on consistent basic models for temporal KG forecasting. IMR transforms reasoning based on path searching into stepwise question answering. In addition, IMR develops three indicators according to the characteristics of temporal KGs and reasoning paths: the question matching degree, answer completion level, and path confidence. IMR can uniformly integrate paths of different hops according to the same criteria; IMR can provide the reasoning paths similarly to other interpretable models and further explain the basis for path comparison. We instantiate the framework based on common embedding models such as TransE, RotatE, and ComplEx. While being more explainable, these instantiated models achieve state-of-the-art performance against previous models on four baseline datasets.


Introduction
Knowledge graphs (KGs) are collections of triples, such as Freebase [1] and YAGO [2]. Temporal KGs introduce a new dimension into static knowledge graphs [3], i.e., a timestamp for each triple to form a quadruple. Although there are billions of triples in temporal KGs, they are still incomplete. These incomplete knowledge bases will lead to limitations in practical applications. Since temporal KGs involve the time dimension, the completion of temporal KGs can be divided into interpolation and forecasting. The former utilizes the facts of all timestamps to predict the triples at a particular moment; the latter employs historical facts to predict future triples. Due to the importance of temporal KG forecasting in event prediction, it has attracted growing attention recently. This paper mainly focuses on temporal KG forecasting.
Most current research on temporal KG completion focuses on interpolation [4][5][6][7][8][9][10]. Recently, there have been attempts to investigate temporal KG forecasting [3,4,7,[11][12][13]. According to the interpretability, research on temporal KG forecasting can be divided into two categories. One type is the black-box model, which designs an unexplainable scoring function for quadruples' rationality. The other type is interpretable approaches. CyGNet [11] utilizes one-hop repetitive facts to realize prediction. Its performance is limited by the lack of direct repetitive knowledge of historical moments. xERTR [7], CluSTeR [3], and TITer [14] are all path-based temporal KG forecasting models. xERTR [7] adopts the inference subgraphs to aggregate local information around the question. CluSTeR [3] and TITer [14] manipulate reinforcement learning for the path search and improve the performance through temporal reasoning.
Thus far, however, there has been little discussion on the following two aspects. Firstly, uniformly measuring the paths of different hops requires handling the same semantics equivalently at different hops. Current models utilize separate networks to evaluate paths at different hops. Although each hop's network has the same architecture, each network acquires different parameters for better performance. Different parameters cause identical semantics to have different scores, so current models cannot truly compare multi-hop paths according to the same criteria. For example, xERTR [7] simply gathers the scores of different paths for comparison, which is mainly based on training datasets. Secondly, although current models can provide reasoning paths, the comparison of paths operates in a black-box fashion. The interpretability of the current models means providing the reasoning paths, which is an essential property of path-based models. These models lack an explanation of the preference for various paths, i.e., they cannot provide the basis for path comparison.
In practice, forecasting based on path searching aims to find the appropriate multi-hop paths, the combination of whose relations is equivalent to the question's relation. As we observe, reasoning based on multi-hop paths is akin to stepwise question answering. Inspired by stepwise question answering, this paper designs a new Interpretable Multi-Hop Reasoning (IMR) framework based on consistent basic models, which can uniformly integrate the paths of different hops and perform more interpretable reasoning.
The primary pathway of IMR can be as follows. IMR first transforms reasoning based on path searching into stepwise question answering based on basic KG embedding models [1,[15][16][17][18] and IRN [19]. This framework calculates the unanswered parts of questions after each hop as the new question for the next hop during the stepwise question answering, which is named the remainder of questions in this paper. Moreover, IMR designs three indicators based on the unanswered parts of questions and the inferred tails: the query matching degree, answer completion level, and path confidence. The query matching degree, i.e., the matching degree between the reasoning tails and the original questions, measures the rationality of the new quadruples. The answer completion level, i.e., the matching degree between the relations of paths and that of the questions, measures the answer's completeness. Path confidence, i.e., the difference between the same entities with different timestamps, measures the reliability of the reasoning paths. IMR achieves the unified scoring of multi-hop paths and better explainable reasoning simultaneously with these indicators' combination.
The major contributions of this work are as follows. (1) A new Interpretable Multi-Hop Reasoning framework (IMR) is proposed in this paper, which provides a new framework for the specific design of forecasting models. Furthermore, IMR defines three indicators: the query matching degree, answer completion level, and path confidence. (2) Unlike other models that cannot measure the paths of different hops uniformly, IMR can measure the paths of different hops according to the same criteria and utilize multi-hop paths for inference. (3) IMR can provide reasoning paths similarly to other interpretable models and further explain the basis for path comparison. (4) Based on basic embedding models, IMR is instantiated as the specific model. Experiments on four benchmark datasets show that these instantiated models achieve state-of-the-art performance against previous models.

Related Work
Static KG reasoning. Knowledge graph reasoning based on representation learning has been widely investigated by scholars. These approaches to reasoning can be categorized into geometric models [1,17,[20][21][22], tensor decomposition models [15,16,18,23], and deep learning models [24][25][26]. In recent years, some scholars have attempted to introduce GCN into knowledge graph reasoning [27], which can improve the performance of basic models. Some other scholars focus on multi-hop reasoning with symbolic inference rules learned from relation paths [28,29]. The above methods are all designed for static KGs, making it challenging to deal with temporal KG reasoning.
Temporal KG reasoning. Temporal KGs import the time dimension into static KGs, which makes the facts of a specific timestamp extremely sparse. The temporal KG reasoning task can be divided into two categories: reasoning about historical facts [4][5][6][7][8]30], i.e., interpolation on temporal KGs, and reasoning about future facts [3,4,7,11], i.e., forecasting on temporal KGs. The former predicts the missing facts of a specific historical moment based on the facts of all moments, and the latter predicts future events based only on the past facts. There are many studies on the task of temporal KG interpolation. However, these studies are all black-box models, which cannot explain predictions. Most of the proposed models for temporal KG forecasting are also black-box models. BoxTE [31] utilizes BoxEmbedding for temporal KG forecasting, which is expressive and possesses an inductive capacity. Recently, xERTR [7], CluSTer [3], and TITer [14] were shown to explain predictions to some extent. These models can provide the reasoning paths for the predictions. However, both models cannot truly handle multi-hop paths crossing the same criteria, which is more similar to the weighted combination. xERTR and TiTer combine the scores of paths with different hops by training weights. Experiments show that CluSTeR performs worse on paths with multiple hops than on paths with only one hop.
Most current temporal KG forecasting models are black-box models. Only some models can provide reasoning paths for prediction. Moreover, none of them can explain how path comparisons work and none of them can integrate paths of different hops uniformly.

Preliminaries
The task of temporal KG forecasting. Suppose that E , R, and T represent the entity set, predicate set, and timestamp set, respectively. The temporal KG is a collection of quadruples, which can be expressed as K = {(e s , r, e o , t), e s , e o ∈ E , r ∈ R, t ∈ T } (1) (e s , r, e o , t) denotes a quadruple; e s and e o represent the subject and object, respectively. r represents the relation, and t represents the time that the quadruple occurs. Suppose that facts happening before the selected time t k can be expressed as Temporal KG forecasting predicts future links based on past facts. This means that its foundation is the process of predicting e o based on a question e s , r q , ?, t q and the previous facts G t q , where r q , t q denote the relation and timestamp of the question. Temporal KG forecasting involves ranking all entities of the specific moment and obtaining the preference for prediction. Temporal KG forecasting based on paths. Knowledge graph embedding associates the entities e ∈ E and relations r ∈ R with vectors e, r. Different from static KGs, the entities in temporal KGs contain time information. The entity may contain different attributes at different moments. In order to better characterize the entity in temporal KGs, we associate each entity e with a specific time label t i ∈ T , so the entity e can be depicted as e t i and its embedding can be denoted as e t i . The set of quadruples directly associated with e t i s , which can be defined as the 1-hop paths associated with e t i s , can be expressed as P (e s ,t i ) = e s , r, e j , t k | e s , r, e j , t k ∈ G t i , where e s , e j ∈ E , r p ∈ R, t k < t i ∈ T . In this way, P (e s ,t i ) can represent all associated quadruples. The set of entities directly associated with e t q s in the path P (e s ,t q ) , i.e., the 1-hop neighbors of e t q s , can be denoted as N (e s ,t q ) = e t h i |(e s , r, e i , t h ) ∈ P (e s ,t q ) , where e s , e i ∈ E , r ∈ R, t h < t q ∈ T . Given the question e s , r q , ?, t q , the forecasting task can be depicted as requesting the entity e o based on path searching. For example, we search the path with e s as the starting point: e s , r p (1) , e 1 , t 1 , e 1 , r p (2) , e 2 , t 2 , . . . , e i−1 , r p (i) , e i , t i where r p (i) denotes the relations of the ith-hop. Thus, answers to the question may be e 1 , e 2 , e 3 , . . . , e i , and the corresponding inference hop is 1, 2, 3, . . . , i, respectively. Moreover, e s (i) , r q (i) denotes the remaining (or unanswered) subjects and relations of questions after the ith-hop paths, which will be explained in Section 4.3.2.
Uniformly measuring paths of different hops. Uniformly measuring paths of different hops requires models scoring paths of different hops according to the same criteria. For example, given question e s , r q , ?, t q and the searched 1-hop path e s , r p , e 1 , t 1 , the score obtained for the searched 1-hop path is f . If we find no path during the first hop, the original question is left to the second hop to solve. Thus, the remaining question (unanswered question) for the second hop is still e s , r q , ?, t q . When the path searched at the second hop is also e s , r p , e 1 , t 1 , the score for the searched path at the second hop should also be f . As is shown in this example, we should score identical semantics equivalently even under different hops. Moreover, the equal comparison of paths provides the basis for the interpretability of path comparison. This attribute constrains models to have an identical scoring mechanism at each hop, i.e., each hop's separate networks for the models based on neural networks should have the same parameters. However, only IMR can meet the attribute.
Fact matching based on TransE. This paper is the first study of the design of interpretable evaluation indicators from the perspective of actual semantics. We instantiate IMR to better illustrate the design pathway and thus choose the basic embedding model TransE as the basis of IMR. In TransE, relations are represented as translations in the embedding space. If the triple (e s , r, e o ) holds in static KGs, TransE [1] assumes the following relationship.
where e s , r and e o ∈ R k , and k denotes the dimension of each vector. For each quadruple e s , r q , e o , t q in temporal KGs, the relation r q can also be taken as the translation from the subject e s to the object e o , i.e., e t q s + r q = e t q o . We suppose that when the distance d of quadruples is smaller, the quadruple will be better matched. The distance of the quadruple e s , r q , e o , t q can be expressed as The relations in KG embedding models indicate the translations between entities, whose specific design determines the complexity of the indicators designed by IMR. The design route of IMR originates from the perspective of reasoning from actual semantics, which is not limited to specific basic models. The consistent basic model of IMR-TransE is TransE, i.e., all IMR-TransE's specific formulas are based on TransE, which will not be explained below. To limit the length of the paper, we move the details of IMR-TransE and IMR-ComplEx to Appendix A.2.

IMR: Interpretable Multi-Hop Reasoning
We introduce the Interpretable Multi-Hop Reasoning framework (IMR) in this section. We first provide an overview of IMR in Section 4.1. IMR comprises three modules: the path searching module, query updating module, and path scoring module. The path searching module searches related paths hop by hop from the subjects of questions, involving path sampling and entity clipping, whose motivation and design are presented in Section 4.2. The query updating module calculates the remaining questions hop-by-hop for each path, involving the update of the subject and relations, whose motivation and design are introduced in Section 4.3. The path scoring module designs three indicators: the question matching degree, answer completion level, and path confidence. This module combines three indicators to evaluate each path, whose motivation and design are presented in Section 4.4. We introduce training strategies and the regularizations on state continuity in Section 4.5. IMR conducts uniform path comparisons based on consistent basic models. To better illustrate this framework, we also include the corresponding instance model (IMR-TransE) in Sections 4.3-4.5. The detailed implementations of IMR-RotatE and IMR-ComplEx are included in Appendix A.2.

Framework Overview
We notice that predicting unknown facts based on paths is akin to answering questions, i.e., the question can be answered directly via finding triples with an equal relation or gradually by utilizing the multi-hop equivalent paths. Inspired by this observation, we take the task of link prediction as stepwise question answering. IMR primarily consists of searching for paths hop by hop, updating the remaining questions for each path, and filtering the best answers based on three indicators: the question matching degree, answer completion level, and path confidence.
We show a toy example in Figure 1. Given a question e s , r q , ?, t q and the previous facts G t q , the task of forecasting is predicting the missing object e o . The steps of IMR are as follows.
Step 1: Starting from the subject e s , we first acquire the associated quadruples P (e s ,t q ) , namely 1-hop paths. We temporally bias the neighborhood sampling using an exponential distribution for the neighbors [7]. The distribution negatively correlates with the time difference between node e s and its neighbor N (e s ,t q ) . Then, we calculate the remaining questions (the remaining subject e s (1) and the remaining relation r q (1) ) for each sampled path. Finally, IMR scores 1-hop paths based on three indicators, which is discussed in Section 4.4.
Step 2: To prevent the path searching from exploding, the model samples the tails of 1-hop paths for the 2-hop path searching. As shown by the pink arrow in Figure 1, the tails of 1-hop paths are clipped according to the scores of 1-hop paths. For the 2-hop paths searched from the clipped tails, IMR samples the paths negatively correlated with time distances. Then, IMR calculates the remaining questions for each 2-hop path (the remaining subject e s (2) and the remaining relation r q (2) ) and scores the 2-hop paths based on three indicators.
Step 3: Rank the scores of 1-hop and 2-hop paths to obtain the preference answer.

Path Searching Module
Inspired by the observation that reasoning based on multi-hop paths is akin to stepwise question answering, this module searches related paths hop by hop from the subjects of questions.
Path sampling. For the path searching from the starting subject e t q s , the number of triples in P (e s ,t q ) may be very large. To prevent the path searching from exploding, we sample a subset of the paths. In fact, the attributes of entities in temporal KGs may change over time. Consider the observation that when t 1 is closer to t q , the attributes of e t 1 s should be more similar to those of e t q s . We also verify the correlation between attributes and the time distance in Appendix A.6. Therefore, we are more prone to sample nodes whose time is closer to t q . In this paper, we employ time-aware exponentially weighted sampling in xERTR [7]. xERTR temporally biases the neighborhood sampling using an exponential distribution of temporal distance.
Entity pruning. The search for next-hop paths is based on the tails of previous-hop paths, so the number of paths is increased by the exponent of dimensions. To avoid the explosion of next-hop path searching, this paper proposes to select the top-K entities for the next-hop search based on the sorted scores of the previous hops.   (1) , r q (1) , , ?, t q , e s (2) , r q (2) , ?, t q , respectively.

Query Updating Module
Given a question e s , r q , ?, t q , there may be a few relations directly equivalent to r q in the temporal KGs for the task of link prediction. More questions need to go through multi-hop paths to infer the outcome. In question answering, a complex question can be decomposed into multiple sub-questions, with one sub-question answered at each step. Thus, inference based on the multi-hop path is equivalent to answering complex questions step by step. Moreover, we need to remove the part resolved to focus on the remaining questions. IMR proposes to update the question according to the last hop and focus on finding the unsolved parts. The query updating module mainly calculates the remaining questions, i.e., the unanswered questions.
The embedding of entities is first introduced in this subsection, followed by the query updating module of IMR-TransE.

Entity Representation
The attributes contained in the entities may change over time. This paper divides the entity embeddings of each timestamp into a static representation and dynamic representation.
Here, the vector e sta denotes the static embedding, which captures time-invariant features and global dependencies over the temporal KGs. The vector e dy represents the dynamic embedding for each entity that changes over time. || denotes the operation of concatenation and MLP(·) denotes the multilayer perceptron (MLP). act(·) denotes the activation function. We provide more details about e sta and e dy in Appendix A.3.

Question Updating
Each path contains a different set of relations. After each hop, the question needs to discard the processed semantic, i.e., to obtain the remaining subject and relation of the question.
Question updating for IMR-TransE. As shown in Figure 1, the subject and relation of the question after the i-th hop path are updated based on Equation (5) as follows.
where the embedding e s (i) and r q (i) represent the remaining subject and relation of the question after the i-hop path, respectively. Moreover, e s (0) = e s , r q (0) = e q and r p (i) denotes the relation of i-th hop path and i is the number of hops for each path.

Path Scoring Module
For the question (Sub, Rel, ?, Tq), we search the 2-hop path (Sub, R1, Obj1, T1),(Obj1, R2, Obj2, T2). The pink box indicates that the original question and the tail of the path are combined as a quadruple to measure the rationality of searched tails, i.e., the question matching degree f qmd . The purple box represents the comparison between the question's relation and the path relations to measure the semantic equivalence between the question and the path, i.e., the answer completion level f ac . These green boxes compare the attributes of the same entities with different timestamps to measure the reliability of the search path, i.e., the path confidence f pc .
We evaluate the path searching from three perspectives. First, the searched tails should match the original question, which means that the correct tails searched by paths and the question should satisfy the consistent basic embedding model. Secondly, the ideal path should be the search for equivalent semantics for relations, not merely the search for the correct tails. It is necessary to ensure the correctness of semantic equivalence, i.e., the path is semantically equivalent to the relation of the question. Finally, considering the particularity of the temporal KGs, the attributes of the same entity may change over time. The current sampling strategy for path searching is to sample adjacent timestamp triples of the same entity. When the attribute value of the entity changes significantly over time, it is inappropriate to perform this sampling strategy for the next hop. We need to ensure that the same entity with different timestamps has similar properties in the same path. In this way, three indicators have been developed by IMR to measure the rationality of the reasoning path, respectively: the question matching degree, answer completion level, and path confidence. Although the current methods, such as models based on reinforcement learning, can have complicated designs, the score functions simply belong to a type of question matching degree. We provide a detailed analysis of the correlation between IMR and reinforcement-learning-based models in Appendix A.5.

Question Matching Degree
For the tails found by path searching, we need to measure the matching degree between the tails and the question, the question matching degree. In fact, the scoring function applied by some traditional reinforcement learning methods is a type of question matching degree. As shown in the yellow box in Figure 2  Question matching degree for IMR-TransE. Question matching degree f qmd in IMR-TransE calculates the distance of the constructed quadruple based on TransE [1]. The better the entity matches the question, the smaller the distance of quadruples will be. The calculation of f qmd for ith-hop path is as follows.
where the p-norm of a complex vector V is defined as V p = p |V i | p . We use the L1-norm for all indicators in the following.

Answer Completion Level
Among the paths to the right tails, some paths are not related to the semantics of the question. Although these paths can infer the tail, these paths are invalid due to being unrelated to the question in semantics. Therefore, IMR designs an index to measure the semantic relevance between the path and the question. Answer completion level f ac indicates whether the combination of path relations can reflect the relation of the question in semantics. IMR takes the remaining relations of the question as the answer completion level, which is calculated based on the distance between the relations of paths r p (1) , r p (2) , . . . and the relation r q . The fewer the relations of a question that remain, the more complete the answer given by the combination of path relations.
Answer completion level for IMR-TransE. The calculation of f ac for ith-hop path in IMR-TransE is as follows.

Path Confidence
Path searching is the process of searching for the next-hop paths based on the tail of the previous hop. When searching for a path, the current sampling strategy is to sample adjacent timestamp triples of the same entity. There are deviations between the same entities with different timestamps in temporal KGs. The premise of this sampling strategy is that only when entities have similar attributes under different timestamps, the path searching is valid. When the entity's attributes change significantly over time, performing an effective next path search is inappropriate. The reasoning path is more reliable when the deviations between entities are smaller. IMR designs path confidence f pc , i.e., the error between the subject of the updated question e s (i) and the tails e t i p (i) of the path with i hops. Path confidence for IMR-TransE. The calculation of f pc for ith-hop path in IMR-TransE is as follows.
where e q (i) represents the remaining subject of the question updated by paths of the length i, and e t i p (i) represents the tail reasoned by the i-hop paths.

Combination of Scores
IMR merges indicators with positive weights to obtain the final score of each path, i.e., f = w qmd * f qmd + w ac * f ac + w pc * f pc , where w qmd , w ac , w pc ∈ R + .
Entity aggregation for IMR. Considering that the searched paths may lead to entities with different timestamps, IMR adopts specific aggregation for searched entities. First, entities with the same timestamp may be inferred by different paths, so IMR needs to combine the scores of entities with unique timestamps. Considering that only one path matches the question best, IMR employs max aggregation to various paths reaching the same entities with the same timestamp. Moreover, specific paths may infer the same entity with different timestamps. IMR performs average aggregation on the scores of entities with different timestamps. Finally, IMR obtains the score of each entity at the question timestamp.

Learning
We utilize binary cross-entropy as the loss function, which is defined as where ε p q represents the set of entities reasoned by selected paths. y e i ,q represents the binary label that indicates whether it is the answer for q and Q represents the training set. f e i ,q denotes the score obtained by Section 4.4.4 for each path. We jointly learn the embeddings and other model parameters by end-to-end training.
Regularization. For the same entity with different timestamps, the closer its time distance is, the closer its dynamic embedding is [32]. IMR proposes the regularization on continuity for the dynamic vectors of entities.
The specific regularization for IMR is as follows.
where e t j k denotes the dynamic embedding of the k-th entity at the j-th timestamp. e t j−1 k , e t j+1 k denotes the dynamic embedding of the previous and later timestamp against e t j k , respectively. · p denotes the p norm of the vectors and we take p as 1 in this paper.

Datasets and Baselines
To evaluate the proposed module, we consider two standard temporal KG datasets Integrated Crisis Early Warning System (ICEWS) [33], WIKI [34], and YAGO [35]. The ICEWS dataset contains information about political events with time annotations. We select two subsets of the ICEWS dataset, i.e., ICEWS14 and ICEWS18, containing event facts in 2014 and 2018. WIKI and YAGO is a temporal KG that fuses information from Wikipedia with WordNet [36]. Following the experimental settings of HyTE [37], we deal with year-level granularity by dropping the month and date information. We compare IMR and baseline methods by performing the temporal KG forecasting task on ICEWS14, ICEWS18, WIKI, and YAGO. Details of these datasets are listed in Table 1. We adopt the same dataset split strategy as in [38]. We compare the performance of IMR-TransE against the temporal KG reasoning models, including TTransE [34], TA-DistMult/TA-TransE [30], DE-SimplE [39], TNTCom-plEx [32], CyGNet [11], RE-Net [38], TANGO [40], TITer [14], and xERTR [7].
In the experiments, the widely used Mean Reciprocal Rank (MRR) and Hits@1,3,10 are employed as the metrics. The filtered setting for static KGs is not suitable for the reasoning task under the exploration setting, as mentioned in xERTR [7]. This paper adopts the time-aware filtering scheme, which only filters out genuine triples at the question time. Tables 2 and 3 show the comparison between IMR-TransE, IMR-RotatE, IMR-ComplEx, and other baseline models on ICEWS, WIKI, and YAGO. Overall, the instantiated models of IMR outperform the baseline models in all metrics while being more interpretable, which convincingly verifies its effectiveness. Due to the limited paper length, a detailed analysis of the interpretability is provided in Appendix A.1. Compared to the best baseline (TiTer), IMR-TransE obtains a relative improvement of 3.3% and 2.5% in MRR and Hits@1, averaged on ICEWS, WIKI, and YAGO. Moreover, different IMR models achieve the best performance across unique datasets due to basic models. Comparison of multi-hop paths. Figure 3 shows the performance of IMR-TransE on ICEWS, WIKI, and YAGO as the maximum length of paths increases. The performance basically continues rising with the increase in the paths' length. However, as the maximum length of paths increases, the performance on ICEWS18 hardly improves. Further analysis on ICEWS18 in [3] explains that there are no strong dependencies between the relations of the question and the multi-hop paths. Thus, longer paths provide little gain for inference.

Main results.
Moreover, as the maximum length of paths increases, the number of inference paths increases exponentially and most of the invalid paths will suppress the performance of IMR-TransE. In order to ensure that the performance of the model does not decrease, we propose to control the sampling number of next-hop paths to limit the total number of multi-step paths and suppress the impact of noisy samples. This paper set the number of next-hop samplings to 5. In summary, experiments show that unified indicators designed by IMR based on consistent basic models can uniformly measure the paths of different hops, allowing better reasoning based on paths with different hops, which verifies the claim in Section 4.4. We present an extra ablation study on three indicators in IMR-TransE in Appendix A.4.

Conclusions
We propose an Interpretable Multi-Hop Reasoning framework for temporal KG forecasting tasks. IMR transforms reasoning based on path searching into stepwise question answering based on consistent basic models. Moreover, IMR develops three indicators to measure the answer and reasoning paths, and this is the first study to develop interpretable evaluation indicators from the perspective of actual semantics for the temporal KG forecasting task. IMR can measure the paths of different hops according to the same criteria and be more explainable. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our method. In the future, we plan to enhance the prediction by integrating different paths reaching the same tail, which will be more effective and interpretable. We will also continue to explore the models based on GAT [3] for temporal KG forecasting tasks.

Conflicts of Interest:
The authors declare no conflicts of interest.

Appendix A Appendix A.1. Case Studies and Interpretability
For the question (John Kerry, Make a visit, ?, 2014-11-11), we extract some of the paths for the case study in Table A1. The lower the scores or indicators in Table A1, the better the performance of the path. We compare the paths based on the total score, analyze various aspects of the paths based on detailed indicators, and verify the interpretation of the model with actual semantics.
The first block of Table A1 selects reasoning paths with the same objects to analyze the answer completion level. First, we compare path 1-1 and path 1-2. The score of path 1-1 is lower than that of path 1-2. As we analyze the three indicators further, we find that the answer completion level of path 1-1 is smaller than that of path 1-2. The comparison of the answer completion level indicates that the relation of path 1-1 should be closer to the relations of the question. Practically, path 1-1 has the same relation as the question, which is closer to the relation of question than path 1-2. Thus, actual semantics verify the interpretation of the model. Comparing path 1-4 and path 1-5, we find that the total score of path 1-4 is lower than that of path 1-5, and the answer completion level of path 1-5 is higher than that of path 1-4. IMR shows that the combination of reasoning relations of path 1-4 is better than that of path 1-5. In fact, these two paths for inference do not seem to be particularly appropriate to the question. Nevertheless, the combination of relations   (John Kerry, Make a visit, ?, 2014-11-11) and their scores, respectively. The second block of Table A1 selects the paths of the same reasoning relations to verify the path confidence and the question matching degree. Comparing paths 2-1, 2-2, 2-3, and 2-4, we observe that the scores of the paths are increasing. Additionally, the path confidence of these three paths is also growing. In fact, the time distance between the paths and the question is gradually increasing, which means that the reliability of the paths gradually decreases. The reliability indicated by path confidence is consistent with the actual reliability. Similarly, we find that the path confidence of path 2-5 is higher than that of path 2-6, indicating that path 2-5 is less reliable. The actual situation is that the timestamp of path 2-5 (2014-11-03 < 2014-11-05) is farther from the timestamp of the question, which is consistent with the explanation. Comparing path 2-9 with paths 2-7 and 2-8, respectively, the model further infers that the path confidence and question matching degree of path 2-9 are better than those of the other two paths. The actual situation is that the timestamp error with the question satisfies path 2-7 > path 2-9 > path 2-8. This is because the question matching degree covers the path confidence. Because the path confidence contains the error of the triple in the training dataset, the triple error covers the error caused by different timestamps, which makes path 2-9 more reliable than path 2-7. In general, the second set of experiments illustrates that the path confidence can effectively indicate the validity of each path.
In the third block of Table A1, we randomly select the paths, explain the paths based on these indicators, and verify them with the actual situation. We first sort three paths according to the answer completion level : path 3-1 < path 3-2 < path 3-3. Therefore, the semantic similarity of relations between the three paths and the question should satisfy path 3-3 > path 3-2 > path 3-1. The actual semantic similarity between the relations of paths and that of the question satisfies Make a visit > Express intent to meet or negotiate > Meet at a 'third' location, which is consistent with the interpretation of IMR. Sort three paths by path confidence: path 3-1 < path 3-2 < path 3-3. The reliability of the three inference paths should satisfy path 3-1 < path 3-2 < path 3-3. We observe that the time distance between the three paths and the question is gradually increasing, which verifies the explanation by path confidence. The analysis of paths 3-4 to 3-6 is similar to the analysis of former paths. Case studies show that IMR can provide reasoning paths and offer a valid basis for path comparison.

Appendix A.2. Details on IMR-RotatE and IMR-ComplEx
Appendix A.2.1. IMR-RotatE RotatE. RotatE [17] defines each relation as a rotation from head entities to tail entities in a complex vector space. Given a triplet (h, t, r), we expect that t = h • r, where h, r, t ∈ C k are the embeddings, the modulus for each dimension of relations satisfies |r i | = 1, and • denotes the Hadamard product. The score function for e s , r q , e o , t q is where e t q s , r q , e t q o ∈ C k , r q i = 1. Question updating for IMR-RotatE.
Question matching degree for IMR-RotatE. Question matching degree f qmd in IMR-RotatE calculates the distance of the constructed quadruple based on RotatE [17]. The better the entity matches the question, the smaller the distance of quadruples will be. The calculation of f qmd for ith-hop path is as follows.
where the p-norm of a complex vector V is defined as V p = p |V i | p . We use the L1-norm for all indicators in the following.
Answer completion level for IMR-RotatE. The calculation of f ac for ith-hop path in IMR-RotatE is as follows.
Path confidence for IMR-RotatE. The calculation of f pc for ith-hop path in IMR-RotatE is as follows.
where e q (i) represents the remaining subject of the question updated by paths of the length i, and e t i p (i) represents the tail reasoned by the i-hop paths.
where e q (i) represents the remaining subject of the question updated by paths of the length i, and e t i p (i) represents the tail reasoned by the i-hop paths.

Appendix A.3. Entity Representation
We denote the static embedding of the entity e k with e sta−k ∈ R d , which is a vector independent of time. IMR-TransE adopts the static embedding in xERTR [41]. xERTR [41] proposes a generic time encoding to generate the time-variant part of entity representations, which can be denoted as Φ(t).
where ω i , φ i , i = 1, 2, . . . , d denote the frequencies and phase shift of time encoding, respectively. Employing this time encoding, quadruples with the same subject, predicate, and object can have different attention scores. Specifically, quadruples that occurred recently tend to have higher attention scores. This makes the embedding more interpretable and effective. In fact, the attribute deviation caused by the time deviation is the only assumption obtained after statistics. It is the semantic attributes of entities that determine the reasoning. In order to avoid being only affected by time factors, we propose a new time-specific entity representation Ψ k (t) ∈ R d , i.e., each entity has a different representation at different timestamps. If each entity applies different representations at every moment, it will consume enormous resources. As most of the entities are only observed at limited timestamps, this paper characterizes the entities whose timestamps only appear in the training dataset. IMR utilizes the embedding of the separate entity when it last occurred in the training dataset to represent the embedding at the timestamps missing from the training dataset. Moreover, we apply regularizations on time continuity to avoid over-fitting caused by too many parameters. This regularization believes that the temporally continuous entities should have closer embeddings, which is described in Section 4.5. Finally, we combine Φ(t) and Ψ k (t) to construct e t dy−k ∈ R 2d .
In summary, the embedding for each entity e t k can be represented as follows: The entities' timestamps in actual datasets are sparse, e.g., ICEWS114 and YAGO have only 11 and 21 timestamps per entity on average. In view of the huge memory usage, we reduce the parameters by basis vectors in actual implementations. The entities' dynamic embeddings are linearly combined by 50 shared vectors. Table A3 shows the memory usage in the ablation experiments on entity-time-specific embeddings.

. Combination of Indicators
The three indicators measure different aspects of the path: the matching degree between answers and the question, the completeness of relational equivalence, and the reliability of the reasoning paths. We verify the performance of each metric through ablation experiments. As shown in Tables A4 and A5, the first block displays the performance with only one indicator, the second block presents the performance with a combination of two parameters, and the last is a combination of three indicators. The bottom line shows the error between the combination of the three parameters and the best result. Since the distribution varies across two datasets, there are certain differences in performance when employing a single indicator to rank paths. The model's performance has significantly improved after incorporating the three indicators in pairs, but a few differences remain. IMR-TransE can obtain the best inference performance in most datasets by combining three indicators. In summary, the experiment illustrates that the combination of three indicators designed by IMR-TransE can effectively measure the reasoning paths. From the above experiments, we can only use two indicators in IMR-TransE. However, IMR can be instantiated based on other models. For example, the performance of IMR-RotatE with any two indicators is quite different. Thus, we should reserve all indicators for the best performance.

Appendix A.5. Correlation between IMR and Other Models
Correlation between IMR and PTransE. Both IMR and PTransE consider measuring the semantic equivalence between relations. PtransE resembles the ensemble, which combines the scores of relations and triples in different models. IMR indicators are based on unified theoretical models (such as TransE or RotatE), which can effectively combine different paths. IMR can truly measure the paths of different hops under the same criteria. Moreover, IMR further designs path confidence for time attributes.
Correlation between IMR and reinforcement-learning-based models. First, the reinforcement learning models are black-box models, which cannot explain the basis of judgment. Moreover, reinforcement learning utilizes rewards, which is essentially a measure of the matching degree between tails and the question. This end-to-end design is essentially that of the question matching degree in IMR, which is unexplainable and complicated.
Moreover, IMR is the first to design indicators from the perspective of actual semantics, so we select the basic embedding models as the basis for IMR to better illustrate the pathway. The modeling of triples in TransE is elementary, so the formulas of indicators are simple. Compared with the complex greedy algorithm, it is natural to take the design of IMR as too simple. Although the design of IMR-TransE is simple, it achieves better performance than reinforcement learning models, such as Titer. The indicators of instantiated IMR models can be more complex and their performance will be better.
Finally, we should design other indicators of IMR based on consistent basic models (such as RotatE). Current reinforcement learning models are commonly based on multilayer networks. We cannot further design the other two indicators.

. Correlation between Path Confidence and Time Distance in IMR-TransE
The current sampling strategy believes that the greater the time distance of the same entity, the greater the deviation of its semantic properties. Therefore, IMR adopts a timenegative sampling strategy to search for more effective paths. Path reliability is affected by semantic similarity, and negative time-aware correlation is a general situation or statistical result. IMR proposes path reliability to better measure the reliability of the searched path. Here, we utilize the path confidence of the same path with different timestamps to analyze the changes in semantic similarity over time. For the same problem, we find the same path with various timestamps. We randomly select 20 questions for the path search, and each question selects the same path containing ten different timestamps to calculate the path confidence. Figure A1 shows how the path confidence of each path changes with time and distance. Figure A1 shows that as the time distance between the paths and questions increases, the score of path confidence gradually increases, indicating that its confidence is gradually decreasing. Experiments show that the semantic deviation of the same entity increases as the time distance increases, which verifies the rationality of time-aware negative exponential sampling.