A Two-Stage Framework for Directed Hypergraph Link Prediction

: Hypergraphs, as a special type of graph, can be leveraged to better model relationships among multiple entities. In this article, we focus on the task of hyperlink prediction in directed hypergraphs, which ﬁnds a wide spectrum of applications in knowledge graphs, chem-informatics, bio-informatics, etc. Existing methods handling the task overlook the order constraints of the hyperlink’s direction and fail to exploit features of all entities covered by a hyperlink. To make up for the deﬁciency, we present a performant pipelined model, i


Introduction
Link prediction benefits in amplifying the relations in graph-structured data [1], arousing interest from both academia and industries. Existing research mainly focuses on simple graphs where a link (also known as a relation) associates with two entities (also known as an entity), while some real-world relations consist of more than two entities, such as chemical reactions [2], co-authorship relations [3], and social networks [4], etc. As shown in Figure 1 Thus, a hyperlink is coined to model such relations, and the graph comprised of hyperlinks is defined as a hypergraph [5].
As the relations among entities are sophisticated, the construct of a hypergraph is time-consuming and hence expensive, making its incompleteness more severe than a simple graph. To mitigate the problem, a hyperlink prediction task is introduced to facilitate the research [6]. Similar to the goal of link prediction in simple graphs, the task tries to complete the missing hyperlinks in a given hypergraph. Figure 1, given several entities, e.g., NYC, New York City, The Big Apple, USA, The United States; the target of the hyperlink prediction is to determine whether there is a hyperlink and what it is (i.e., "Located In") once existing. Furthermore, the directivity of hyperlinks also matters in some practical applications. Thus, the machine should also acquire the ability to predict the direction of the hyperlink to form the final answer, i.e., NYC, New York City, The Big Apple Located In −→ USA, The United States. Figure 1. Sketch of two types of hypergraphs. The diagram on the left represents an undirected hypergraph while the diagram on the right stands for a directed hypergraph. One ellipse denotes a hyperlink. entities in the same ellipse share the same hyperlink. Arrow denotes the direction of the hyperlink.

Example 1. Consider the bottom ellipse in green in
To approach this task, current studies mainly fall into two categories: (1) Translationbased models try to generalize the translation constraint in simple graphs to hypergraphs, e.g., m-TransH [7], RAE [8], and NHP [9]. m-TransH directly extends TransH [10] for binary relations to the n-ary case, and RAE further integrates m-TransH with multi-layer perceptron (MLP) by considering the relatedness of entities. Since they use the sum after the projection as the scoring function, when some entities in a hyperlink change, it may not be obvious in the scoring function. (2) Neural-network-based models exploit structural information of hypergraphs, e.g., NaLP [11], HGNN [12], and HyperGCN [13]. These methods design some graph neural networks (GNNs) to absorb neighbouring features to improve entities' representations. As GNNs usually incorporate a large number of parameters, the sufficient learning process relies on the amount of training samples.
Albeit attracting attention, hyperlink prediction is still notoriously challenging, since existing studies neglect the cores of the task. First, sometimes, the accurate record of facts in a hypergraph necessitates the direction of hyperlinks. For a directed hyperlink, the entities can be divided into two parts-head and tail-based on the hyperlink's direction. This mandates that the order of the two parts matters; in contrast, the specific order in each part is insignificant. As shown in Figure 1, without the arrow pointing, we cannot figure out how these entities construct the relation "Located In". In addition, NYC, New York City, and The Big Apple (also known as the head) should be in front of USA and The United States (also known as tail), but the order inside the head or tail does not affect the determination. Nevertheless, existing methods mainly focus on undirected hyperlinks. The only method, namely, NHP, tries to average the entity embeddings generated by GCN [14] to calculate a score for inferring the hyperlink direction, which is too rudimentary to embody the direction's features. Second, as a hyperlink contains more than two entities, each entity contributes to the final existence prediction. In this light, a good representation model needs to consider the representation of all the individual entities involved in a hyperlink when making a determination. However, the current treatment of embedding tends to apply a simple sum or average strategy. This might be insensitive to the number of entities in a hyperlink since an entity with effusive containment could overwhelm other entities' expressions. Last but not least, as it is sometimes complicated for even a human being to annotate hyperlinks, there is a lack of training data, which can be currently insufficient to train a large number of learnable parameters well.
In order to address these challenges, we propose a simple yet effective model, which is a Two-stage Framework for Directed Hyperlink Prediction, namely, TF-DHP. The model is expected to equally consider the entity's contribution to the form of hyperlinks and emphasize not only the fixed order between two parts but also the randomness inside each part. It conceives a pipeline of two tailored modules: a Tucker decomposition-based module for hyperlink prediction and a BiLSTM-based module for direction inference.
For predicting the existence of hyperlinks, we exploit Tucker decomposition to model hyperlinks, which, to the best of our knowledge, has not been applied to hypergraphs except simple graphs [15]. In particular, instead of applying three-order Tucker decomposition over simple graphs, we employ high-order Tucker decomposition for hypergraphs. It produces a core tensor, which represents the degree of interaction between entities. Then, we devise a scoring function by the mode product of the tensor with each entity representation, which evaluates the existence of hyperlinks. We theoretically show that the score is invariant to the order of mode product with entities, though there is a direction of each hyperlink. In addition, it is noted that the tensors from Tucker decomposition are usually of very high order, which can bring about high computational complexity. To mitigate the issue, we further introduce Tensor Ring (TR) [16] decomposition to decompose higher-order tensors into mode products of several third-order tensors, which effectively reduces the computational cost.
For inferring directions, we first recall that example in Figure 1. Once USA and The United States are determined as the tail entities, the substances in the head entities are implied, and if there is a change in one of the tail entities, the head entities are going to be different. Thus, it is of importance for the model to pass the information between the two parts both forward and backward. This motivates us to design a model that works bidirectionally. In this connection, BiLSTM [17] is utilized to serve as the base model. In addition, the position of entities within the head (or tail) part is insignificant, and hence, it is necessary to train the model to attend only to the order of the two parts. For this characteristic, we keep the order of two parts but randomly shuffle entities within each part to enforce the model to be ignorant of entity positions within head (or tail) part, while being attentive to the order between the two parts. In this way, the data scale is increased as a by-product, alleviating the lack of data.

Contribution.
In summary, we make the following contributions: • For existence prediction, we propose, among the first, to generalize Tucker decomposition to a high dimension and introduce a tensor ring algorithm to reduce the model complexity. We theoretically prove that the mode product for scoring a hyperlink is invariant of the order of participating entities. • For direction inference, we conceive a BiLSTM-based model that can take information into consideration both forward and backward with respect to a hyperlink. A data shuffling strategy is further incorporated to enforce the model to be ignorant of entity positions within the head (or tail) part while being attentive to the order between the two parts. • The modules constitute a new model, namely, TF-DHP for predicting directed hyperlinks. Through the experiments on several real-world datasets, we confirm the superiority of TF-DHP over state-of-the-art models.
Organization. The rest of the article is structured as follows. Section 2 introduces related work and Section 3 provides a detailed account of TF-DHP. Section 4 reports the experimental setup and analyses the experimental results. Section 5 concludes the paper.

Related Work
In this section, we are going to review related work in link prediction on simple graphs, undirected hypergraphs and directed hypergraphs.

Link Prediction on Simple Graph
Most of the link prediction methods on simple graphs can be divided into three categories-linear mathematics models, non-linear convolutional models, and random walk models.
There were many linear mathematics ways of link prediction created in recent years such as RESCAL [18], DistMult [19], ComplEx [20], and SimplE [21]. RESCAL, which is based on tensor factorization, performs collective learning via the latent components of the model and provides an efficient algorithm to compute the factorization. DisMult is a special case of RESCAL with a diagonal matrix per relation which reduces overfitting while ComplEx extends DisMult to the complex domain. SimplE is based on Canonical Polyadic (CP) decomposition, in which subject and object entity embeddings for the same entity are independent. TuckER [15] is a straightforward but powerful model based on the Tucker decomposition; it considers the core tensor as the parameter tensor, and the scoring function is defined by taking the modular product between the entities embedding vectors, the relation embedding vector, and the core tensor. Because the information loss in the calculation process is greatly reduced by using high-order tensors to define parameters, TuckER is proved to be the best-performing linear mathematics model to handle the link prediction task on simple graph.
Typical works of non-linear convolutional models are ConvE [22] and HypER [23]. ConvE is a simple multi-layer convolutional architecture for link prediction and is defined by a single convolution layer, a projection layer to the embedding dimension, and an inner product layer. HypER's hypernetwork generates relation-specific filters, and thus extracts relation-specific features from the subject entity embedding. It necessitates no 2D reshaping and allows entity and relation to interact more completely, rather than only around the concatenation boundary.
LRW [24], MIRW [25], and MLRW [26] are random walk-based models for link prediction on complex networks, LRW is conducted using pure random walking and selects the destination entities based on a random manner. To help to improve the LRW, the concept of asymmetric mutual influence of entities is presented, and using this concept, the walker selects the next entity using its effect on the current entity and selects more efficient paths for the next step. Therefore, entities with a more significant structural similarity will obtain a higher score in the proposed algorithm MIRW. MLRW provides a framework to extend the local random walk method to multiplex networks so that we can take advantage of intra-layer and interlayer information presented in the network and increase the accuracy of link prediction properly.

Link Prediction on Undirected Hypergraph
The general work on the undirected hyperlink prediction can be divided into two species, i.e., translation-based models and neural network-based models.
The representative model of the translation-based approaches are m-TrnasH [7] and RAE [8]. m-TransH generalizes TransH [10] to the case of n-order relations, and it projects entities onto the relation-specific hyperplane and defines the scoring function as the weighted sum of projection results. RAE considers the possibility of common occurrence between entities in n-order relations, establishes the correlation model through MLP, and reflects it in scoring function. Since these models are extended from binary models, restrictions on the representation of relations are also carried to the representation of n-order relations.
NaLP [11], HyperGCN [13], and Hyper-SAGNN [27] are three neural network-based approaches. HGNN is a general hypergraph neural network framework based on hypergraph convolution operation, which can incorporate multi-modal data and complicated data correlations. HyperGCN proposes a new method of training a GCN on hypergraph using tools from the spectral theory of hypergraphs and applying the method to the problems of SSL(hypergraph-based semi-supervised learning) and combinatorial optimization on real-world hypergraphs. Hyper-SAGNN develops a new self-attention based graph neu-ral network applicable to homogeneous and heterogeneous hypergraphs with variable hyperlink sizes.

Link Prediction on Directed Hypergraphs
The research of link prediction on directed hypergraphs is not very mature, and most methods prefer predicting the direction of the hyperlink after finishing predicting the entities contained in the hyperlink. The NHP [9] model sets up two scoring functions to predict hyperlinks and their directions based on the GCN template, and they divide a hyperlink into two sub-hyperlinks and use their embedding vectors to compute the scoring function for direction. However, as the embedding vectors of hyperlinks are from the average value of entity embedding vectors, information about entities and their positions is lost, which makes the performance of the model barely satisfactory.

Method
This section formalizes the task of the directed hypergraph link prediction and presents the proposed method, including the framework and module details. Definitions of notations used in the text are shown in the Table 1.

Symbol
Definition n-mode factor matrix ∈ R I n ×J n u n j n j n -th column vector of U (n)

φ(·)
scoring function of the existence of hyperlinks r relation embedding of hyperlink v m embedding of entities × n tensor n-mode product Z k (i k ) i k -th lateral slice matrix of TR origin tensor Trace(·) matrix trace operator ⊕ concatenating operation for hidden layers • vector outer product

Task Description
A directed hypergraph is an ordered pair H = (V, E), where V = {v 1 , . . . , v l } denotes a set of entities and l is the number of entities. E comprises a set of directed hyperlinks, formally: Each element in E can be divided into two components, where h (resp. t) serves as the head (resp. tail), with the direction being from the head to the tail.
The directed hyperlink prediction aims to predict the missing hyperlinks, including the existence and associated direction, based on the relevance of the given entities. Take relation knowledge in Figure 1 as an instance. Entities in each relation build the V, and their corresponding relation forms the directed hyperlinks E. Every sample in the dataset will contain an uncertain number of substances. We have to determine whether they can support a relation knowledge and which component each entity belongs to.

Framework
TF-DHP consists of a Tucker decomposition-based hypergraph link prediction model and a BiLSTM-based direction prediction model to predict directed hyperlinks among entities sets in a directed hypergraph. It is then optimized by a ranking objective in which scores of existing hyperlinks are ranked higher than those of non-existing entity subsets and scores of positive directions are higher than those of negative directions. The framework is shown in Figure 2.
We generalize TuckER [15] to the high dimension and regard it as a scoring function. We use the scoring function after obtaining the embedding vectors of every entity in an entity set to evaluate whether the hyperlink exists or not. If the hyperlink does exist, we divide the entities set into two groups based on the direction label of each entity and then use the BiLSTM model [17] to evaluate the direction between the groups which can be defined as the direction of the hyperlink. Meanwhile, we also randomly sort the entities in each group to increase training data according to the characteristic that the order of entities in each group does not influence the direction.

Figure 2.
A sketch of TF-DHP directed hypergraph prediction model. The embedding of entity sets to be predicted are fed into the Tucker-decomposition-based layer to calculate the score. The target of model training is to make the score of existing hyperlinks larger than the score of entities set without hyperlinks. Then, the embeddings of entities in the existing hyperlink are sent to the BiLSTM layer to calculate the direction score. The target of model training is to make the score in the positive direction larger than the score in the negative direction.

Tucker Decomposition-Based Hyperlink Prediction Module
To predict hyperlinks of the entity set, we propose a Tucker decomposition-based scoring function and provide mathematical proof of its irrelevance with the order of inputs.

Tucker Decomposition-Based Scoring Function
Tucker decomposition is a tensor decomposition algorithm that decomposes higherorder tensors into a core tensor and several factor matrices. The core tensor reflects the degree of interaction between different factor matrices. The formal expression is as follows: where X ∈ R I 1 ×I 2 ×...×I k−1 denotes the original tensor, ω ∈ R J 1 ×J 2 ×...×J k−1 denotes the core tensor and J 1 J 2 · · · J k−1 are much smaller than I 1 I 2 · · · I k−1 , k denotes the order of X , U (1) , . . . , U (k−1) denotes the set of factor matrices, and the mathematical symbol × k denotes the tensor product along with the kth mode. The dimensions of the core tensor are smaller than those of the original tensor in each order, so the core tensor can be regarded as the dimensionality reduction in the original tensor.
Based on the Tucker decomposition of the representation tensor, we design the scoring function to score each hyperlink. Specifically, if a hyperlink contains m entities, we first select the corresponding entity and relation embeddings. Then, a parameter tensor is designed as the core tensor containing learnable parameters shared by entities and relations [15]. Our goal is to optimize these parameters to fully exploit the relevance among entities and the associated relations based on their embeddings. The scoring function can be expressed as below: where m changes with the number of entities contained in the hyperlink, and the order of the tensor Z is equal to one plus the number of entities. r denotes the relation embedding of the hyperlink to be predicted, and v 1 , v 2 , . . . , v m are the embeddings of entities contained by the hyperlink. Since the tensor product of a tensor with a vector will change the dimension of its corresponding order to 1, we can repeat the process m + 1 times to acquire a real number. This real number is further regarded as the score of this hyperlink. As every entity in the hyperlink and the relation embedding are computed simultaneously, Equation (3) reduces information loss. Nevertheless, the computational complexity becomes enormous with the increase in the number of entities because of the inner computation of the high-order tensor product. To address the issue, we use the TR [16] decomposition algorithm. It represents a high-order tensor by a sequence of third-order tensors multiplied circularly, mathematically: where T denotes the original tensor of size n 1 × n 2 × · · · × n d , Z k denotes a set of thirdorder tensors whose dimensions are r k × n k × r k+1 , i k denotes i k -th layer matrix in the second-order of the tensor, and Tr denotes the trace of the product of matrices. The tensor ring decomposition makes the third dimension of the last decomposed tensor the same as the first dimension of the first decomposed tensor. The advantage is that when we make a circular shifting of the decomposed tensor, the results will not be changed because of the matrix trace operation. Tensor ring decomposition dramatically reduces the computational load of the model when the tensor order is large by decomposing higher-order tensors into products of third-order tensors. The computational complexity grows sharply when the order of the core tensor grows, so we use the TR decomposition on the core tensor to decompose the high-order tensor into several three-order tensors multiplied circularly. Based on the definition of TR decomposition, every single parameter in the core tensor can be computed by the trace of the matrices product. It can be expressed in the tensor form [16], given by: where Z i (α k , α k+1 ) denotes the vector corresponding to the index in the tensor and the symbol • denotes the outer product of vectors, r 1 , . . . , r n correspond to the dimension of the first and 3rd order of the tensor. We use the simplified form Z = Tr(Z 1 , Z 2 , . . . , Z n ) to represent the decomposition of the core tensor. Combining with Equation (3), we can rewrite the scoring function as: This scoring function not only considers all the entities and relation information contained in a hyperlink but also controls the model complexity within an acceptable range. As shown in Table 2, the scoring function above has fewer parameters than NaLP and is not easy to overfit in the datasets which are not large enough, concretely shown in Figure 3. Table 2. Scoring functions of several models for undirected hypergraph link prediction tasks, with the significant terms of their model complexity. n e and n r are the number of entities and relations, while d e and d r are the dimensionalities of entity and relation embeddings respectively. n is the number of entities in a hyperlink and d max is the maximum size of TR latent tensors. maxmin is the element-wise difference of maximum and the minimum values of the vectors.

Model
Scoring Function Model Complexity RAE Σ n j=1 a j (e i j − w T i r e i j w i r ) + r i r p O(n e d e + n r d r ) NaLP ...; e i n ]])))) O(n e d e + nn r d r )  This model is based on the scoring function of Tucker decomposition, and because the model needs to determine the order of the core tensor, the model cannot process the hyperlinks with different number of nodes in one time. For datasets with such hyperlinks, we need to classify them before predicting, which increases the workload to a certain extent.
As the order of the core tensors increases, the number of third-order tensors required by TR decomposition increases accordingly, which will increase the amount of computation to a certain extent. The machine used in this paper can deal with the prediction task of hyperlinks with up to six nodes.

Proof of Sequence Independence
As illustrated above, the Tucker decomposition processes the inputs sequentially, while the order of entities contained in one hyperlink does not influence the determination,which requires the invariance property of our scoring function. We prove that the order of entities' and relations' embeddings in the tensor product makes no difference to the result. We first rewrite the scoring function in the tensor-wise form: In the mentioned TR decomposition, the matrix trace operation and the same dimensions of the input and output ensure the invariance of circular shifting. When it comes to the hypergraph, the dimensions of entities and relations are set to a fixed value, which makes the invariance not only in circular shifting but also in order changing between every single entity. It means the change in the order of the product does not change the result. So, we just need to prove that the order of the tensor product in the Tucker decomposition has no effect on the result. The element-wise form of the tensor product is as follows: On the right-hand side of the equation, if we regard the indices j 1 , . . . , j n as a set of integerindependent variables and their variation range is from 1 to J 1 , . . . , J n , (u i n j n ) can be regarded as the functions of these independent variables, the meaning of the function value is the value of the element at the corresponding position in the entity embedding vector indexed by the independent variable. We use f 1 (j 1 ), f 2 (j 2 ), . . . , f n (j n ) (in Equation (9)) to represent the functions. The expression ω j 1 j 2 ···j n can be regarded as a multivariate function whose form is g(j 1 , j 2 , . . . , j n ), and the value of the function means the parameter on the corresponding position of the core tensor.
Then, we find that if we make the independent variables take the value of all real numbers from 1 to J n instead of being integers, we can transform Equation (8) into a multiple definite integral: The integral domain D of this multiple integrals is an n-order tensor that has the same size as the core tensor. Changing the order of independent variables in g(j 1 , j 2 , . . . , j n ) does not change the corresponding parameter; thus, the order of j 1 , . . . , j n has no influence of the function g(j 1 , j 2 , . . . , j n ) f 1 (j 1 ) f 2 (j 2 ) · · · f n (j n ).
Since the functions f 1 (j 1 ), . . . , f n (j n ) are all unary function, the integral can be rewritten as: · · · D g(j 1 , j 2 , . . . , j n )dj 1 dj 2 · · · dj n For the multiple definite integrals · · · D g(j 1 , j 2 , . . . , j n )dj 1 dj 2 · · · dj n , the limit of integration for each order are finite constants, and the order of j 1 , . . . , j n makes no difference to the function, so changing the order of integration does not change the value of the definite integral. Therefore, the whole integral has the invariance property. Because Equation (8) is a special case of Equation (9), the scoring function is proven to have the invariance property.

BiLSTM-Based Direction Prediction Module
In the directed hyperlink prediction problem, the embedding of each entity further determines the existence of a hyperlink and its direction. However, different from the existence prediction, the direction of a hyperlink emphasizes the order of entities. For example, in the related knowledge "WDC, Washington D.C Capital O f −→ USA, The United States", the direction comes from WDC and Washington D.C (also known as head entities) to USA and The United States (also known as tail entities). Once a substance is placed in the wrong component, the reaction might not even exist. In addition, the interaction between two components, e.g., conservation of materials, indicates that the model cannot individually determine the components. Therefore, we apply BiLSTM in our module to encode all entities sequentially to achieve the information passing both forward and backward.
As shown in the Figure 4 The BiLSTM consists of several LSTM hidden layers. These hidden layers are divided into two groups that meet end-to-end in opposite directions. The entities' embeddings in the hyperlink are calculated in the hidden layer of the corresponding position one by one. Meanwhile, the state of the previous hidden layer is calculated in the next hidden layer together with the embedding of the entities fed into the corresponding layer. After all hidden layers have been calculated, embedding containing all sequential information is generated.The same process occurs in the backward hidden layer group, which means we can obtain two embeddings of the hyperlink. We concatenate them into one vector and then send it to a Softmax layer to obtain the direction score. The specific expression of the process is as follows: where h t denotes the concatenated embedding of the sequential representation, − → h t and ← − h t are calculated by two hidden layers in opposite directions, w t denotes the embedding for the tth entity, and the symbol ⊕ means the concatenating operation. give all generated instances a correct label to enforce BiLSTM to exploit features of the direction. The strategy can enlarge the data scale without introducing external manual efforts, which also contributes to tackling the low-data regime problem.

Training
TF-DHP is a pipeline model, which means that we predict the hyperlink's existence in the first stage and judge the direction of the hyperlink in the second stage. If we use the data of undirected hypergraphs to train the first stage of the model separately, we can obtain a model that can perform link prediction of undirected hypergraphs. If the whole model is trained on the data of the directed hypergraph, the trained model can have the ability to predict directed hyperlinks.
The TF-DHP is trained in two stages, which keeps the same pace with the framework. The training goal of the first stage is to provide the existing hyperlink with a higher score while decreasing the score of entities that cannot comprise a hyperlink.With the initial embeddings of entities and their labels as input, we use the Tucker decomposition-based scoring function to obtain two kinds of the score, and a binary cross-entropy loss function is designed to maximize their gap.
After the first stage of the model is trained, we acquire the updated core tensor and embeddings and use these embeddings to initialize the second stage of the model. Two kinds of scores are calculated in the BiLSTM. One is the score of the correct direction, and the other is the score of the wrong direction. The specific expression of the loss function is as follows: L = f mean (log(1 + e f mean (σ(φ dn ))−σ(φ dp ) )) (15) where f mean denotes an average function, σ denotes the sigmoid function, φ d n denotes the score of each negative hyperlink, and φ d p denotes the score of each positive hyperlink. Finally, the BiLSTM-based model updates the model parameters and embeddings of entities and relations based on the loss gradients.

Experiment
This section reports the experiments.

Experimental Setup
We detail the adopted datasets, evaluation metrics, parameters, and baselines.

Datasets
We use two public relational datasets in our experiment for undirected hypergraph link prediction and one open KB canonicalized dataset for directed hypergraph link prediction. We brief these datasets below. • WikiPeople [11]: WikiPeople is a public n-ary relational dataset concerning entities of type human extracted from Wikidata. WikiPeople is an incomplete hypergraph with many hyperlinks missing [11]. In WikiPeople, each set of entities has one kind of relationship. We use this dataset to train the undirected hyperlink prediction model. • JF17K [8]: JF17K is a public n-ary relational dataset that has high-quality facts. It is filtered from Freebase while having multi-fold relational structures preserved. The same as WikiPeople, each set of entities has one kind of relationship, and we use this dataset to train the undirected hyperlink prediction model. • ReVerb15K [9,28]: ReVerb45K is an open KB canonicalization dataset [28], and it is constructed by intersecting information from ReVerb Open KB [29], Freebase entity linking information from [30], and Clueweb09 corpus [31]. In triples of the original dataset, there may be different subjects or objects having the same meaning. Based on the Freebase entity linking information, we cluster the synonyms of the subjects or objects in one set, and use each cluster to represent the new subject or object. In this way, a canonicalized directed hypergraph dataset is obtained. Since it contains about 15 K entities, we call it ReVerb15K. The treated subject entities represent head hyperlinks, and the treated object entities represent the corresponding tails; the direction is from head to tail.
The specific size of datasets are shown in the Table 3.

. Metrics And Parameters
We test the effectiveness of the model in two parts. One is the Tucker-decompositionbased model for predicting the undirected hyperlinks, the other is the whole framework for predicting the directed hyperlinks. The total hyperlinks in datasets are divided into three parts: 20% for training, 10% for validation, and 70% for testing. We evaluate the link prediction performance via two standard metrics: MRR and Hits@k (k is top ranking). MRR is the mean of the inverse of rankings over all testing facts, while Hits@k measures the proportion of top k rankings. The aim of the training is to achieve high MRR and Hits@k.

Baselines
We compare TF-DHP with the following n-ary hyperlink prediction baselines: • RAE [8]: RAE is a translational distance model which considers the possibility of common occurrence between entities in n-order relations, establishes a correlation model through MLP, and reflects it in the scoring function. • NaLP [11]: NaLP is a neural network model that achieves the state-of-the-art n-ary hypergraph link prediction performance. • HGNN [12]: This is a general hypergraph neural network framework for data representation learning based on hypergraph convolution operation, which can incorporate multi-modal data and complicated data correlations. We use maxmin + as a scoring layer and a direction scoring layer [9] for directed hyperlink prediction with HGNN. • HyperGCN [13]: This is a new method of training a GCN on hypergraph using tools from spectral theory of hypergraphs. Since it is not directly proposed for hyperlink prediction, we use the same scoring layers as used on HGNN. • NHP-U-mean and NHP-U-maxmin [9]: These two methods are both based on the GCN layer. NHP-U-mean uses mean as the scoring layer while NHP-U-maxmin uses maxmin + as the scoring layer to predict hyperlinks. These two methods are proposed for undirected hyperlink prediction. • NHP-D-mean and NHP-D-maxmin [9]: These two methods use a direction scoring layer on NHP-U-mean and NHP-U-maxmin to predict directed hyperlinks. Tables 4 and 5 show the undirected hyperlink prediction results on two datasets. The highest scores are set in bold. As shown in the tables, we can find out that our proposed TF-DHP can achieve optimal results under various measurement standards, consistently. For both datasets, graph neural networks NHP combining the mean or maxmin scoring functions cannot have comparable performances in link prediction problems. For example, on WikiPeople, compared with our proposed model, TF-DHP, the MRR of the first four methods is only about a third, and Hits@10 is about a half. The large improvement of TF-DHP can strongly confirm that scoring functions such as mean or maxmin largely ignore the influence of the representation of each entity in the hyperlink on the predicted results, which also reflects the advantage of Tucker-decomposition-based model taking every entity embedding into the computation. As for the translational distance model RAE, although RAE achieves slightly better results than the four methods, its results are still unsatisfying. On WikiPeople, TF-DHP improves MRR by 0.21 and Hits@1 by 0.15, which is a considerable improvement. The main reason for the unsatisfying performance of RAE is the restriction on relations of the translational distance model. Such restriction does not exist in the Tucker-decompositionbased model. Tucker decomposition can accurately represent any ground truth over a set of entities and relations by its full expressiveness [15].

Experiment on Undirected Hypergraphs
The performance of NaLP is much better than the aforementioned methods due to the enormous amount of model parameters. It uses a neural network to greatly reduce the restriction on relations existing in the translational distance model. However, a large number of parameters makes it easy to over-fit, especially when training datasets are not big enough. According to the network structure and scoring function of NaLP, the model complexity of NaLP is O(n e d e + nn r d r ), with n e and d e representing the number and dimension of entities, respectively. n is the number of entities in one relation. n r and d r stand for the number and the dimension of relations, respectively. However, the model complexity of the first stage of TF-DHP is only O(n e d e + n r d r + nd 3 max ), where d max is the maximum dimension of the third-order tensors in TR decomposition. Since the number of relations is much larger than the dimension of the decomposed tensor in hypergraphs, the model complexity of NaLP is apparently larger than TF-DHP. As shown in Figure 3, with the training epoch growing, NaLP requires more training epochs than TF-DHP to achieve the optimal result. Moreover, because too many NaLP parameters lead to an over-fitting issue, the results decrease when the epoch is larger than 100. However, due to relatively few parameters, the results of TF-DHP are relatively stable after reaching the optimal result during training. Table 6 shows the results of several directed hyperlink prediction models. The highest scores are set in bold. To the best of our knowledge, there are few models dealing with the hyperlink prediction problem in directed hypergraphs. As shown in Table 6, TF-DHP obtains considerable improvement compared with other methods. For example, for the best baseline NHP-D-maxmin, TF-DHP improves MRR by 0.056 and Hits@10 by 0.026. We believe that there are two main reasons for the better prediction performance of TF-DHP on directed hypergraphs. First, when testing the directed hypergraph prediction model, we put the weighted average of scores computed in two stages of the model as the final score, which means we regard an entity set as positive only if there exists a directed hyperlink among the entities set with the direction also being correct. So, the accuracy of the first stage of the model will inevitably affect the performance of the whole model. Second, the NHP-D-maxmin and other methods in Table 6 use the average value of entities' embedding vectors to represent the embedding of the hyperlink and consider the product of embedding vectors of head and tail parts of the hyperlink as a scoring function. As mentioned above, these methods ignore the influence of each entity embedding on the direction of the hyperlink and the relationship between an entity and its adjacent entities. The improvement of experimental results proves that considering the representation information of each entity separately and the information of the adjacent entities (from forward to backward) can improve the accuracy of directed hypergraph prediction.

Parameter Analysis
Embedding size is a significant factor in hyperlink prediction models, determining the performance of the model to a large extent. Hence, we will analyze the results obtained by the model in different embedding sizes to investigate its impact.
First, according to Figure 5a, TF-DHP outperforms other methods on each embedding size. The MRR of TF-DHP increases sharply with the early stage of increasing the embedding size and becomes smooth after the embedding size increases to 15. The MRR of NaLP is almost identical to TF-DHP's from the start; however, due to a large number of parameters, it cannot reamain smooth like TF-DHP when the embedding size increases. After the embedding size increases to a certain extent, NaLP's MRR will decrease. For other methods, the change in embedding size has less influence on the experimental results due to their smaller number of parameters. Figure 5b shows the impacts of embedding size on directed hyperlink prediction. The same as undirected hyperlink prediction, TF-DHP always outperforms other methods. As BiLSTM is added, the optimal embedding size of the model increases to 25, after which the increase in MRR becomes smooth. As for other methods, the addition of the direction scoring function also increases the optimal number of parameters and shares the similar tendency as TF-DHP.
It proves the stability of TF-DHP on the choice of the dimension size. In addition the reasonable amount of parameters of TF-DHP allows it to be more stable, as other models' performances may decrease with the increasing dimensions, suffering from the over-fitting issue.

Approximate Training Time Comparison
On the two undirected datasets WikiPeople and JF15K, TF-DHP takes around 45 min of training time, while NaLP and RAE take around 3 h and 1 h, respectively. On the directed dataset Reverb15K, TF-DHP takes around 1 h of training time, while NHP-D-maxmin and NHP-D-mean take around 15 min each due to their oversimplified scoring function. All were run on a GeForce GTX 1080 super GPU machine.

Ablation Study
Since experiments on the directed hypergraph dataset have proved the effectiveness of the BiLSTM model, we designed an ablation study to prove the influence of TR decomposition in Tucker decomposition. We designed a variant on WikiPeople of TF-DHP which does not use TR decomposition on Tucker decomposition, and we call it n-Tucker. As shown in Figure 6, without TR decomposition, the computational complexity of the model greatly increases, which will result in an over-fitting issue. Similar but better than NaLP, n-Tucker reaches the optimal value of MRR and then gradually decreases due to the over-fitting issue. This kind of experiment not only proves the superiority of the Tucker decomposition-based model but also proves the necessity of the TR decomposition.

Conclusions and Future Work
In this paper, we introduce TF-DHP, a novel model for hyperlink prediction for both undirected and directed hypergraphs. We use a tensor-decomposition-based method to handle the undirected part and add a BiLSTM model to predict the direction of the hyperlink. Our model TF-DHP is a pipelined model, which is flexible to deal with not only directed hypergraphs but also undirected hypergraphs. The experimental results verify the advantages of TF-DHP in both settings across multiple datasets.
In the future, we plan to further look into heterogeneous hypergraphs where there are multiple types of high-order relations, such as inclusion relations and produce relations, and to see how directed hypergraphs can be used on reaction prediction in chemical or biological domains.