Exploring the Performance of Continuous-Time Dynamic Link Prediction Algorithms

Dynamic Link Prediction (DLP) addresses the prediction of future links in evolving networks. However, accurately portraying the performance of DLP algorithms poses challenges that might impede progress in the field. Importantly, common evaluation pipelines usually calculate ranking or binary classification metrics, where the scores of observed interactions (positives) are compared with those of randomly generated ones (negatives). However, a single metric is not sufficient to fully capture the differences between DLP algorithms, and is prone to overly optimistic performance evaluation. Instead, an in-depth evaluation should reflect performance variations across different nodes, edges, and time segments. In this work, we contribute tools to perform such a comprehensive evaluation. (1) We propose Birth-Death diagrams, a simple but powerful visualization technique that illustrates the effect of time-based train-test splitting on the difficulty of DLP on a given dataset. (2) We describe an exhaustive taxonomy of negative sampling methods that can be used at evaluation time. (3) We carry out an empirical study of the effect of the different negative sampling strategies. Our comparison between heuristics and state-of-the-art memory-based methods on various real-world datasets confirms a strong effect of using different negative sampling strategies on the test Area Under the Curve (AUC). Moreover, we conduct a visual exploration of the prediction, with additional insights on which different types of errors are prominent over time.


Introduction
Many real-world phenomena such as computers networks [44], epidemics [29], neural networks [2], email exchanges [20], and face-to-face interactions [6,20] can be modeled as a set of objects interacting through time.This type of data is commonly represented as a dynamic graph [13], where nodes represent the objects and edges represent pairs of objects that interact through time.While initial attempts at capturing the temporal evolution of networks typically aggregated the interactions into a sequence of static graphs, recent efforts incorporate time continuously, avoiding the loss of some fine-grained temporal information during data preprocessing [13,31].The resulting type of data is commonly referred to as Continuous-Time Dynamic Graphs (CTDGs) [33].Modeling and forecasting CTDGs has recently become a very active field of research, as suggested by recent surveys [17,26].A crucial task of interest is Dynamic Link Prediction (DLP), where the goal is to predict future links from a history of observed ones.This task has gained considerable attention, as seen from recent benchmarks [14], and finds notable applications in recommender systems, influence detection, routing in networks or disease prediction [36,21] to name a few.
Creating a standardized evaluation for Dynamic Link Prediction (DLP) algorithms poses significant challenges [33].Firstly, the benchmark datasets available vary widely in nature, leading to different domain-specific DLP tasks.
Predicting which pair of students will have a face-to-face interaction in the HighSchool dataset at a given time is for instance very different from predicting which item a given user will interact with in the Wikipedia dataset.Secondly, the evaluation pipelines differ among methods.This often results in near-perfect performance metrics and contributes to a bias where each paper tends to favor its own proposed approach.Lastly, typical metrics for DLP will compare the score of the actual interactions occurring in the Dynamic Graph with the scores of interactions that didn't happen, obtained through random Negative Sampling (NS) [32].As underlined in [33], the procedure used to generate these negative events can have a dramatic impact on DLP performance measures, to the point that some sophisticated methods can often be outperformed by parameter-free heuristics.
As a result, there is a growing awareness that dynamic link prediction performance measures not only depend on the model quality and the challenging nature of the data but crucially also on the strategy for sampling negative events.Notably, a suggestion proposed by Poursafaei et al. [33] was to generate more challenging negative samples by examining which edges were previously seen or not at test time.These conclusions align with the guidelines proposed by Junuthula et al. [16], who suggest splitting the DLP task into two tasks: predicting previously observed links and predicting previously unobserved links, each of these tasks coming with their own specific metrics.This same work attributes these challenges to the fact that, while conventional Machine Learning tasks target independent and identically distributed (iid) data, events, nodes, and node pairs in a (dynamic) graph do not satisfy this property.As a consequence, the prediction performance is likely to exhibit substantial variations depending on the node, edge, or time interval considered.
Despite this growing awareness, central aspects of the DLP task remain ambiguous to this day.Notably, there is a lack of tools for understanding the domain-dependent effect of splitting a history of interaction into a train history and a test history based on a cutoff time.Yet, such an understanding is crucial for designing relevant NS strategies for evaluation.Furthermore, the time-evolution of prediction performance tends to be disregarded, despite its significant relevance in real-world applications.
Contributions In this work we investigate the open challenges discussed above through visualization and empirical evaluation.
1. We introduce the Birth-Death diagram.As illustrated in Fig. 1, this simple plot facilitates the visualization and comparison of the lifetimes of nodes and edges (potentially extending to higher-order structures) in a CTDG.Crucially, this visualization tool enables a clear representation of the partitioning of these objects as influenced by the time-based splitting of the history of events.We discuss two key measures derived from these plots, the node and edge surprise indices, quantifying the difficulty DLP.We analyze and compare different real-world datasets in terms of these measures.
2. We demonstrate the utility of Birth-Death diagrams in the design of more useful Negative Sampling (NS) strategies for evaluation.By means of these diagrams, we construct a comprehensive taxonomy that categorizes the types of nodes/edges suitable for use as negative instances against which to contrast the scores of positive events.Subsequently, we leverage this taxonomy to develop more targeted NS strategies specifically intended for evaluation.We assess six key NS strategies derived from this approach and analyze the resulting variations in performance.
3. Finally, we incorporate time in the evaluation and present a simple visualization method to analyze the time-evolution of Dynamic Link Prediction performance of several recent methods and some heuristics.
Our experiments1 confirm that the performance of methods depend highly on the strategy used for NS.Moreover, the strategy which leads the model to commit more prediction errors (i.e.where the model scores the negative event higher than the positive) varies through time.This observation opens up opportunities for comparing methods empirically by juxtaposing their profile of performance over time.
Outline This paper is divided as follows.In Sec. 2 we start by discussing related work on Dynamic Graphs, Dynamic Link Prediction and strategies for evaluating this task.In Sec. 3 we formally introduce the Birth-Death diagrams, along with corresponding statistics (the node and Edge Surprise indices), which are central to assess the difficulty of the DLP task on a given dataset.In Sec. 4 we subsequently discuss the evaluation of Dynamic Link Prediction algorithms through NS, and propose a taxonomy of the types of negative samples that can be derived.Finally, in Sec. 5 and 6 we conduct numerical experiments to assess the effect of the different NS strategies, and the evolution of the predictive performance over time.[8].The y and x coordinate for each node/edge represent their first (Birth) and last (Death) interaction time respectively.Given a cutoff-time t split , while the history of interaction gets divided into a train and a test set, the nodes and edges get partitioned into three categories: Historical, Overlap and Inductive.The Surprise Index is the ratio Inductive Inductive+Overlap .

Related Work
Temporal Networks.Temporal Networks have been utilized to model extensive systems that comprise entities interacting over time, see Masuda and Lambiotte [31] and Holme and Saramäki [12] for a general introduction.As highlighted by Rozenshtein and Gionis [35], Temporal Networks have been studied under different terminologies, including Dynamic/Temporal Graphs/Networks, depending on the task and the way time is modeled.In particular, some works consider time as discrete [10] and others as continuous [22].In the present paper, we consider Continuous-Time Dynamic Graphs, as defined in [33,34], and the representation used is the "sequence of interaction model of Temporal Networks" [35].The emphasis is on predicting the events on different time intervals, and not on studying emergent properties.Although our focus is on continuous-time graphs, the methodology proposed in this paper applies to both discrete and continuous-time representations.
Visualizing Temporal Networks.Due to the additional complexity introduced by the time dimension, visualizing temporal graphs faces unique challenges, as discussed in recent surveys [1].As detailed by Linhares et al. [25], the time aspect exacerbates visual clutter issues, which are already common in static networks visualization.They propose Node Activity Maps, a method to visualize node activity over time, but do not consider the evaluation of Dynamic Link Prediction as a use case.Temporal Edge Traffic (TET) and Temporal Edge Activity (TEA) plots proposed by Poursafaei et al. [33] enable understanding the effect of time-based splitting on the edges.However, as these visualizations focus on visualizing the events directly, edges that are exclusively observed in the training set are represented in the region as those observed both in the train and test set, rendering a visual comparison of these two sets difficult.In contrast, our proposed Birth-Death diagram focuses on directly illustrating the division of nodes and edges into three distinct regions.
Methods for DLP.(Dynamic) Link Prediction is a longstanding problem, and various classes of methods have been proposed to address it.Early efforts focused on using traditional tools from statistics and network science, often borrowing from existing literature on modeling of static network.These include univariate time-series models [9,15], similarity-based methods [23], probabilistic generative models [7,11,42,18], and matrix and tensor factorization [5].Nevertheless, with the success of deep learning and representation learning on static graphs, recent approaches have shifted towards using neural networks, as surveyed in Kazemi et al. [17], Longa et al. [26].Notably, memory-based dynamic graph neural networks (DGNNs) such as TGN [34] and Dyrep [37], use an encoder-decoder architecture.The encoder maps each node to a time-varying representation in a low-dimensional space, while the decoder allows calculating the probability of interactions from the latent representations of the nodes.More generally, the idea of learning a vector representation (embedding), either at the node-level or the edge-level, has been explored in several other methods [3,28,40,39,41,45].
It is important to consider that there is currently a lack of fair and objective comparison between these embedding-based techniques and shallow methods such as the ones previously detailed.Our objective hereby is not to propose a novel method for DLP, but rather to introduce a new performance visualization approach capable of effectively visualizing the DLP task and supporting the performance evaluation of existing methods.Nonetheless, Birth-Death diagrams are particularly pertinent for assessing and diagnosing memory-based DGNNs.Indeed, these models work by maintaining a memory state summarizing the history associated with specific nodes or edges over time.The accuracy and usefulness of this memory state is greatly influenced by the times at which these nodes/edges start (Birth Time) and stop (Death Time) interacting.We emphasize that our study does not consider Negative Sampling for training, which has its own challenges [4], but rather for evaluation in Link Prediction.
Challenges in Evaluating DLP algorithms.Although many methods for static and dynamic Link Prediction have been proposed in the past, the formal definition and evaluation of this task has been subject to many debates and refinements over the years.Many evaluation methods have been proposed, including set-based metrics [23], Receiver Operator Characteristic (ROC) curves and associate Area Under the ROC Curve (AUC-ROC) [27,24], Average Precision [43].While the above studies used the time information mainly for train-test splitting the data into a training and test, Tylenda et al. [38] presented ways to incorporate the time aspect into the method and evaluation, demonstrating its positive impact on performance.Subsequently, Junuthula et al. [16] suggested separating the DLP problem into two tasks: the prediction of either recurring edges or newly observed edges.They propose a metric combining AUC-ROC and Area Under the Precision-Recall Curve to incorporate these two aspects.
In these studies, the impact of Negative Sampling was often overlooked in the evaluation.More recently, Poursafaei et al. [33] proposed more challenging negative samples for deep learning-based DLP methods.They introduced three strategies: Random, Historical, and Inductive.In this paper, we extend this literature by proposing a visualization-based method for separating the possible negative samples into categories, with an emphasis on distinctively sampling from Overlap and Historical edges/nodes.Moreover, we introduce a principled way of scrutinizing the changes in performance over time, depending on the negative sampling strategy used for evaluation.

Understanding the Effect of Splitting a Dynamic Graph based on Time
In this section, we provide some background on Continuous-Time Dynamic Graphs (CTDGs).Subsequently, we introduce the notions of birth and death time, and the associated Birth-Death diagrams, a visualization tool that allows one to understand the effect of splitting a dynamic graph based on time.Based on this tool, we introduce the node and Edge Surprise indices as metrics for quantifying the difficulty of predicting future links on a given dynamic graph dataset.

Background: Continuous-Time Dynamic Graphs
For a set of nodes U and maximal time T , a Continuous-Time Dynamic Graph (CTDG) is defined through a stream H = {(u, v, t)} ⊂ U × U × [0, T ] of events (u, v, t) that each represent an interaction of the source node u with the destination node v at timestamp t.We use the term edge to refer to a pair of nodes (u, v) at an unspecified time.In directed graphs, such an edge (u, v) in an event (u, v, t) is an ordered pair of nodes; in other words this means that node u sends an interaction to v at time t.Conversely, undirected graphs treat edges as unordered pairs of nodes {u, v}; so an interaction (u, v, t) means that u and v interacted at time t.In practice, an undirected edge can be uniquely identified by (min(u, v), max(u, v)).Further, note that CTDGs allow events to occur at any continuous-valued timestamp 0 ≤ t ≤ T and allow multiple events to happen at the same time.
To index the collection of all events H in the CTDG, we also introduce some helpful shorthand notations.We use u} to represent the set of all events in H that involve the node u and to represent the set of events that involve the edge (u, v).We also define H t as the subset of all events in H that occur up to a certain time t, i.e.
Overall, our goal is to better inform evaluation of Dynamic Link Prediction over CTDGs.We hold off on formally introducing this task until Sec. 4 and first consider a core decision in any machine learning evaluation: how the data is split up into training and testing data.A typical assumption in dynamic graphs is that we will only need to make predictions about future events.Hence, the train-test split is commonly determined by a cutoff time t split that partitions the set of events H into the train set of past, known events H train = H tsplit and the test set of 'future', unknown events In what remains of this section, we characterize nodes and edges by whether they are active exclusively in the train set, test set, or in both.

The Birth and Death of Nodes and Edges
Dynamic graphs dynamically evolve over time.A key motivation for our contributions is that many real CTDG datasets only have nodes and edges that interact within a specific timeframe.For instance, in a social network, a new user may join (represented as a node), or a pair of users (i.e. an edge) may cease interacting entirely at a certain time.To formalize these concepts, we introduce the following definitions: Definition (Birth Time).For any node x = u or edge x = (u, v), the birth time b x H is defined as the earliest time at which this node or edge is involved in an event in the history H: Definition (Death Time).For any node x = u or edge x = (u, v), the death time d x H is defined as the latest time at which this node or edge is involved in an event in the history H: Note that, by definition, b x H ≤ d x H .These definitions allow us to capture the lifespan of nodes and edges in dynamic graphs, which is crucial for understanding their behavior and evolution over time.
We argue that the birth and death times of nodes and edges are highly relevant when splitting up a CTDG's events into a train set H train and test set H test .Such splits are done to assess a model's ability to generalize to unseen data that is encountered in real-world applications, but CTDGs typically see nodes and edges reoccur often.In fact, exploiting recurring patterns is an implicit goal of any machine learning task.Previous work [16,33] has hypothesized that it is far easier for any parametrized model to predict if and when an edge occurs in the test set H test , if it has already learned from the occurrences of the same edge in the train set H train .The formal definition of the birth and death times helps to elucidate this assumption.For instance, we can state that the occurrence of an edge in the test set will seem more likely to a model if its birth time b x H was before t split .Likewise, nodes that were already active in the train set, i.e. they were 'born' at time b x H < t split , will be better understood and less surprising than nodes with a birth time b x H ≥ t split .Moreover, the extent to which different methods are capable of accurately predicting previously unseen edges in the test set may vary between these different situations.Understanding such differences may be important for choosing the most appropriate method in a particular application.Therefore, a prudent and useful analysis of (predictions over) a CTDG benefits from partitioning nodes and edges into three categories, which we define here.In all definitions, we denote by t split the time at which the train-test split is made.(5)

The Birth-Death Diagram
For a given a cutoff time t split , the birth and death times of nodes and edges clearly distinguish whether they are historical, Overlap, or Inductive.By extension, the distribution of their birth and death times characterizes the difficulty of the test set for any cutoff time.
We therefore introduce the Birth-Death diagram: a scatter plot visualization that represents each node or edge by its birth time (on the y-axis) and death time (on the x-axis).Fig. 1 illustrates the Birth-Death diagram on a dataset of Face-to-Face interactions between High School Students.Moreover, in Fig. 2, we plot the Birth-Death diagrams for different CTDG datasets2 from the benchmark of Poursafaei et al. [33].Remark 1.For all datasets in our illustration, the cut-off time is determined as the 1 − α-th quantile of the event times, where α is a train-test split ratio set to α = 0.15.In simpler terms, this means that the cut-off time t split is set to the point in time beyond which 15% of the events occur.Any events that occur before this point in time are included in the training set, while any events that occur after this point in time are included in the test set.
From Fig. 2 we can draw some interesting observations, which we discuss here.
Seasonality of birth and death times.The seasonality of lifespan patterns can be observed in both the High School and MOOC datasets as the points corresponding to edges and nodes tend to cluster into squares representing days.The UCI dataset, which describes online interactions between students from April to October 2004, also shows seasonality in the holiday break: there is a white stripe during the holiday break (slightly before t split ).This seasonality is a crucial property of the Link Prediction task at hand.In the case of the HighSchool dataset, it means that we are observing a few complete days of interactions between the students, and that we are trying to predict when and who will interact in the following days.For these datasets, carefully representing time using techniques such as time encoding [41] can be crucial to get good performances.
Short-lived nodes and edges.The high density of points on the diagonal of these diagrams, in particular in the Wikipedia dataset, indicates that most nodes and edges have very short lifespans.As a result, information learned by models about these instances has a higher chance of becoming obsolete after some time.There, it is crucial that methods take into account time and prioritize attend more to active edges/nodes.Conversely, however, some nodes and edges have very long lifetimes in the Wikipedia and Flights datasets (their scatter points are in the lower right), so memorization may work well on these.
"Easy" datasets.There are some datasets, namely UNtrade, Enron, CanParl, and HighSchool, where most of the nodes are observed at least once in the train set, with only a few nodes starting interactions in the test set.Memorization heuristics such as Preferential Attachment and EdgeBank will constitute be strong baselines for these datasets.
Finally, it is crucial to note that for User-Item graphs such as wikipedia, mooc or lastfm, the Birth-Death diagram will yield different profiles for the Users and Item nodes.We showcase and discuss these differences in Appendix A.

The Surprise Index
The Birth-Death diagrams suggest that the proportion of Overlap and Inductive nodes and edges is highly dependent on the cutoff time t split .We formally assess this proportion through the Surprise Index, i.e. the proportion of inductive nodes/edges in the test set.
Definition.The Node/Edge Surprise Index is defined as the proportion of nodes and edges x in the test set (i.e.their death time d x H ≥ t split ) that only appear in the test set (i.e.their birth time b x H > t split ).In other words, it is the ratio between the number of Inductive (i.e.having a birth time after t split ) and the number of Inductive or Overlap (i.e.having a death time after t split ) nodes/edges.Mathematically, considering x to be either the nodes or edges: Indeed, by definition the Inductive nodes/edges are those whose birth time is after t split , while objects which are either Inductive or Overlap are those whose death time is after t split .
Typical ML pipelines will learn from the training data by hinting on recurring patterns (concepts present in the data) and learn to recognize these patterns.The test data may reproduce these patterns to some extent, along with some other signals previously unobserved by the model, which will be reflected in the fact that the predictions will not be imperfect.
In the case of the DLP, we dispose of concrete ways of measuring the quantity of information in the test set that will be new for the model.The Surprise Index is a natural way of measuring that.
In Figure 3, we present the Node Surprise Index against the Edge Surprise Index for various datasets, with train-test splitting ratios ranging from 0.1 to 0.5.By examining these indices, we can observe some interesting properties.
The Surprise Index is not necessarily monotonous with the size of the test set.It may be intuitive to assume that the Surprise Index increases monotonically with a larger proportion of events included in the test set (i.e., earlier cutoff times t split ).However, this is not always the case.For instance, in the CanParl dataset, increasing the test ratio from 0.3 to 0.4 actually decreases the Edge Surprise Index, while substantially increasing the Node Surprise Index.A similar non-monotonicity can be observed for the Enron dataset, where increasing the test ratio from 0.3 to 0.4 similarly decreases the Edge Surprise Index.This paradox arises from the fact that the Surprise Index is a non-decreasing function of the ratio between Inductive events (those that occur only in the test set) and Overlapping events (those that occur in both the training and test sets).Thus, if adding more events to the test set increases the number of Overlapping events faster than it increases the number of Inductive events, the Surprise Index will actually decrease.This observation The Edge Surprise Index is typically Higher than the Node Surprise.All datasets evaluated here exhibit curves in the lower right of Figure 2, indicating that the Edge Surprise Index is generally higher than the Node Surprise Index.While this may seem obvious, we stress here that it doesn't have to be the case.Indeed, what is clear is that if a given node was never observed during training but starts to interact during testing, then it means that the corresponding edge it forms during testing was necessarily never seen during training.As a consequence there are always at least as many Inductive edges as there are Inductive nodes.However, the Surprise Index increases with the ratio between the number of Inductive and Overlap nodes/edges, and this ratio doesn't necessarily have to be larger for edges than for nodes.For instance, suppose that four nodes A, B, C and D interact with each other during both the training and test period, resulting in a total of 6 Overlap edges.Now, suppose that a node E starts interacting with A during testing.In this case the number of Inductive nodes and edges are both 1.However, the Node Surprise Index is 1  5 , while the Edge Surprise Index is 1  7 , which is smaller.In contrast, if E had started interacting with all four nodes, then we would have an Edge Surprise of 4  10 which would be larger than the Node Surprise.Another example would be if two nodes E and F would start interacting but only with each other during testing.In this case, the Node Surprise Index would be 2  6 , while the Edge Surprise Index would be 2  8 which is smaller.As such it is in itself an interesting pattern that the Edge Surprise Index is generally higher than the Node Surprise Index in all the considered datasets.Reporting both these indices in practice may give a good first overview of the difficulty of the DLP task at hand.Domain Dependency.The difference in growth rates between the Node and Edge Surprise indices varies widely across different datasets.For some datasets, increasing the number of events in the test set will increase the proportion of nodes present in the test set that were not observed in the training set.For instance, in the CanParl dataset, increasing the test ratio from 0.1 to 0.3 increases the Edge Surprise Index from 0.15 to around 0.78, while the Node Surprise Index only undergoes a 0.1 increase.This means that while in both cases most of the nodes will already have been observed in the training set, the number of previously unobserved edges will increase significantly.This illustrates the fact that seemingly small changes in the evaluation setting can have a dramatic impact on the difficulty and type of task at hand.The exact values of the node and Edge Surprise indices, on the datasets from Poursafaei et al. [33], are provided in Table 2.

Towards More Targeted Negative Sampling Strategies for Dynamic Link Prediction
The Birth-Death diagrams introduced in Section 3, illustrate the partitioning of nodes and edges into distinct categories: Historical, Overlap, and Inductive.The comparison between a test event and a negative event involving an edge or nodes from any of these categories presents varying levels of difficulty for the task of discriminating the true event from the negative one.In this section, we operationalize these insights by formally defining Dynamic Link Prediction (DLP) and its connection to Negative Sampling (NS), before uncovering a taxonomy of the NS strategies targeting different aspects of DLP performance.

Background on Dynamic Link Prediction
Having thoroughly analyzed the nodes and edges in a CTDG, we now formalize the task of Dynamic Link Prediction (DLP).Definition.The DLP problem is the task of distinguishing positive (true) interactions (u, v, t) from interactions (u ′ , v ′ , t) that do not occur the same time t.
For example, the task at hand can be to predict which two people u and v, at the present time t, are most likely to interact in a social network.
Algorithms for DLP.In practice, DLP algorithms are required to output a score s(u, v, t|H t ) that expresses the likelihood of the event (u, v, t) given the past history of events H t up to time t (as any future events would be unavailable at that time).When discussing DLP algorithms, it will then be helpful to also define U t as shorthand notation for the set of nodes interacting up to time t and E t for the set of edges.We note that parametric DLP algorithm (e.g.neural networks) will typically output a score s(u, v, t|H t ) that is a function of the past history H t and of the parameters of the model.Thus, to evaluate the score a given test event (u, v, t), DLP algorithms are given access to the history up to time t, but are not allowed to update their parameters based on these test events.
Negative Sampling for Evaluation.As already discussed, the goal of DLP algorithms is to accurately score the events (u, v, t), conditioned on the past H t at time t.Ideally, a perfect model would score any positive event (u, v, t), i.e. an event that actually occurs in the history H, higher than any negative event (u ′ , v ′ , t) such that (u ′ , v ′ ) ̸ = (u, v), i.e. an event that could have occurred (but did not) at time t.The default strategy would thus fetch all possible edges and calculate their scores jointly with the positive at time t.However, computing the scores for all possible edges scales quadratically with the number of nodes.Even for reasonably sized networks with a few thousand nodes, this renders the exhaustive comparison intractable.Consequently, as is the case in static link prediction, it is common to strongly subsample the set of possible negatives.We now formally define this crucial step of the evaluation.Definition.A Negative Sampling strategy is a mapping that takes as input a positive event (u, v, t) ∈ H and returns a set of K associated negative events {(u (k) , v (k) , t)} k=1,...,K occurring at the same timestamp t.
Problems with naive Negative Sampling strategies.A straightforward NS strategy is to swap the source u and/or destination node v of the positive event (u, v, t) by other nodes u ′ ∈ U and/or v ′ ∈ U uniformly at random.Though common, this strategy is naive.The vast majority of possible negatives at time t tends to be unrealistic for various intuitive reasons.For instance, many edges never occur at all in the graph, and many nodes only interact long before or after time t (i.e. are probably inactive at this time).Such unrealistic NS strategy may give an unbiased estimate of the exhaustive performance (obtained through comparison of the positive with all the possible edges).However, including trivial negatives into the comparison will steer the accuracy to a value close to 1, rendering such accuracy rather uninformative.To get accuracy estimates which align better with the actual task at hand, it is therefore important to only consider more challenging negatives.In what follows we will investigate how to do so.

A Taxonomy of Negative Samples
Given that unrealistic NS leads to uninformative evaluation of DLP, how might we generate more useful negative samples?
Clearly, this depends on the task at hand.For example, in the Flights dataset, the goal is to predict which destination a plane in a given origin airport will depart to at a given time.In this case, most edges will have been observed already, and it is more interesting to evaluate whether our model can distinguish between the actual origin and destination of the flight and an origin-destination pair that has been previously observed.Therefore, a realistic negative sample could be generated by replacing the actual edges with previously observed edges.
Similarly, in the case of social networks such as e-mail datasets, the goal may be predicting the receiver of a message emitted by a given sender at a certain time.In this case, one may want to specifically sample a negative edge (u ′ , v ′ , t) such that the destination node v ′ interacts in both the train and test set.The intuition for such a choice is that models will tend to naturally assign a higher score to events whose nodes have interacted in the train set.Moreover, the fact that these nodes are also present in the test set indicate that they may still be active at the time of the positive interaction, thus making the negative interaction a reasonable candidate.
Here we aim at contributing NS strategies which can be applied to any dataset or task.Instead of focusing on specific applications, we present general taxonomy of potential edges that can be used as negative samples.DLP performance can then be appropriately scrutinized for different types of negatives in any application.Considering that machine learning evaluation typically distinguishes performance on the train set from performance on the test set, we make heavy use of our definition of temporal categories in Sec.3.2.
Assume any NS strategy starts from a positive event (u, v, t) and then 'corrupts' it into (realistic) negatives {(u (k) , v (k) , t)} k=1,...,K .As discussed previously, the timestamp t is left unchanged, as this is the time at which the prediction is assumed to be made.We can then either corrupt one of the nodes in the event, (which we call negative node sampling) or both (which we call negative edge sampling).For both, different strategies can be discerned, as described in the following two definitions: Definition (Negative Node Sampling).Negative Node Sampling (Node-NS) takes a positive event (u, v, t) and replaces the source node u by u ′ ∈ U or the destination node v by v ′ ∈ U.In both cases, the other node is left the same.
For a given cutoff time t split , we distinguish six types of negative node samples: Historical Destination (HD) In undirected graphs, source and destination sampling are equivalent.Definition (Negative Edge Sampling).Negative Edge Sampling (Edge-NS) takes a positive event (u, v, t) and replaces the edge (u, v) by another edge (u ′ , v ′ ) ∈ U × U.
For a given cutoff time t split , we distinguish three types of negative edge samples: These strategies lend themselves to a straightforward visualization interpretation.For instance, on the Birth-Death diagrams (see for instance Fig. 2), HE, OE and IE correspond to sampling the negative edge respectively from the set of blue, orange or green points.
In the rest of this work, we will conduct experiments using the strategies HE, OE, and IE.These strategies enable us to compare the score of a positive event (an action that did happen) with the score of an event that occurred in the dataset but at a different time.Furthermore, we will closely investigate the HD, OD, and ID strategies.These are particularly relevant for evaluating many DLP tasks (e.g.recommendation) where we compare the score of an interaction from a given source (such as a user) to the true destination (an item or another user) with the score of a negative destination that is randomly sampled.The HD, OD, and ID strategies help us understand the impact of choosing a negative destination from different time intervals.This choice can significantly affect the results of an evaluation.

Experimental Setup
In the remaining of the paper, we conduct extensive experiments in order to validate our evaluation tools, and answer two main questions.

Datasets
We selected 5 datasets of various sizes from a recent benchmark [33].Wikipedia is a dataset of edits of Wikipedia pages recorded over one month.Mooc is a dataset of online behavior of students interacting with content (items) on a Mooc platform.LastFM is a dataset of interaction between users and songs listened by users.UCI is a communication network of university students exchanging over a social network.Enron is a dataset of emails between employees of a company, over a period of three year.

Methods
In our experiments, we consider 4 different DLP algorithms.The first two methods are simple parameter-free heuristics, which work by memorizing either the nodes or the edges observed in past events.We describe them here, and detail the type of error that these are susceptible to commit: • Preferential Attachment (PA) assigns a score of 1 to an event (u, v, t) if and only if both the nodes have been observed in the past H t : (6) This method issues false negatives when the true event (u, v, t) is such that either of u or v was never observed prior to t, and False Positive when the negative event (u ′ , v ′ , t) is such that both u ′ and v ′ were involved in a past event.
• EdgeBank [33] assigns a score of 1 to an event (u, v, t) if and only if the edge (u, v) has been observed in the past H t : This method yields False Negatives whenever the true event (u, v, t) is such that the edge (u, v) was never involved in any event up to time t.It produces False Positives whenever the negative event (u ′ , v ′ , t) is such that (u ′ , v ′ ) was involved in an event before t.
While simple, these baselines are helpful reference points to assess the performance of more sophisticated methods, as they relate directly to the Birth-Death diagrams presented in Sec.3.3.Indeed, one can think of Preferential Attachment and EdgeBank as follows.If we draw a horizontal line at the y coordinate corresponding to the current time t, PA will assign a score of 1 to any event (u, v, t) such that both u and v are below the line, i.e. have a birth time prior to t: b u H < t and b v H < t.Similarly, EdgeBank will output a score of 1 to any event such that the edge (u, v) is represented by a point below the horizontal line with y coordinate equal to t (i.e.b (u,v) H < t).On top of these methods, we consider two memory-based dynamic graph representation learning methods, which we briefly introduce here.
• TGN-attn [14] is a DLP algorithm composed of two main modules.A memory module is responsible for maintaining a node-level memory state, encoding the past events at the node level.Using an attention module, the memory states of different nodes are then combined and subsequently used to calculate the probability of the event (u, v).We use • DyRep [37] has a similar architecture, but specifically uses an attention mechanism on the destination node in order to update the memory given an incoming event.Remark 3. In terms of implementation, we use a common approximation to the memory-based methods discussed above.Instead of updating the memory state (set of observed nodes or edges for PA/EdgeBank, node-memory states for TGN and Dyrep) at every newly observed events, the interactions are consumed by mini batches of 200 events.For each batch, the models successively compute a prediction score for the events in this batch and the associated negative samples, and then update their memory by ingesting the events in that batch.

Metrics
Binary Classification.In a first experiment, we view DLP as a binary classification task.For each positive event, we draw a negative event at random, following a specific NS strategy.We thus obtain as a result a list of labeled events, where the label indicates whether it is an event that actually occurred or whether it is a negative sample.Considering a given threshold on the prediction scores, a confusion matrix such as the one in Table .3 may then be constructed, measuring the amount of Positive/Negative events that get scored higher (positive prediction) or lower (negative prediction) than the threshold.By varying the threshold, we can then draw the corresponding Receiver Operating Characteristic (ROC), and compute the associated Area Under the Curve (AUC).Note that this is typically done per batch of events, as the prediction scores may not be comparable across batches corresponding to different time spans.The reported AUC is the average of the AUCs obtained on the different batches.Ranking.In order to assess which negative edge type is more likely to deceive the model at a given time, in a second experiment we consider ranking as a measure of performance.More precisely, we suppose that for each positive event, we assign it a certain number K of negative events, obtained through various NS strategies.Then we calculate for each of these entries the rank of the positive but also of the associated negatives.Thus, for each event (u, v, t) in the test data, we have a list of ranks corresponding to the positive event, and to different NS strategies, say NS1, NS2, etc.
In order to get a regularly sampled time series, we then partition the time interval [0, T ] time into B = 50 bins I 1 , ..., I B of equal size.For each interval we calculate the Mean Average Rank (MAR) of all the positives within that interval and do the same for each NS strategy.As a result, we get for each event type (positive, NS strategy 1, NS strategy 2 ...) a time series of the ranks of the associated events in the different intervals.
For instance, suppose that the first interval I 1 contains 4 positive events, to each of which we adjoin 2 negative events (coming from 2 different NS strategy NS1 and NS2).Now using the DLP algorithm, we obtain scores for each of the positives and the negatives.Suppose that, as result the ranks of the 4 positive events are 1, 3, 2, 1, the ranks of the negative obtained using NS1 are 2,1,3,2, and the ranks for NS2 are 3,2,1,3.Then for this interval, the MAR for the positive, NS1 and NS2 events are 7 4 , 2, and 9 4 respectively.

Results
In this section we discuss our experimental results, with the aim of answering two research questions.The goal is to assess the impact of the different NS strategies at the aggregate level (subsection 6.1) and over time (subsection 6.2),

How do performances of DLP algorithms vary over different NS strategies?
In Figure 4a we report the AUC scores against three Edge-NS: Historical Edge, Overlap Edge, and Inductive Edge Negative Sampling strategies.Similarly, in Fig. 4b we report AUCs for three Destination-NS: Historical Destination, Overlap Destination, and Inductive Destination, as defined in Section 4.2.
In general, the AUC scores are higher when swapping only the destination node, as in Fig 4b .This makes sense when thinking that swapping the destination with a random nodes may lead to edges that never happened overall, and thus to negative events that are very unlikely.The Overlap Edge and Overlap Destination seem to lead to the lower scores in general.Indeed, these strategies yield events that hit a good trade-off between having enough memory about the associated nodes/edges to yield a high score, and being sufficiently novel so that the models don't know yet how to discriminate them from the positive event.In contrast, for historical edges and destination, the involved nodes and edges become relatively obsolete after a certain time, an effect which seems to be picked up by the models.
Starting with the baselines, we note that EdgeBank and Preferential Attachment show very similar performances against Historical and Overlap Edges, with EdgeBank performing worse than random (AUC<0.5).This can be explained by reminding that, at test time, all the Historical Edge and Overlap Edge will have a score of 1.In contrast, the true event may have a score of 1 or 0 depending on whether the associated edge or nodes were previously observed.In that context, the false positive rate is greater than 0 only if the decision threshold is at least 1 (if it is below 1, all the negatives are predicted negative).When the decision threshold reaches the value 1, the number of false positives jumps to the number of negative (all the negatives will be predicted positives).In particular, the false positive rate will increase faster than the true positive rate, resulting in an AUC score lower than 0.5.
These figures indicate that NS can be defined such that heuristic baselines outperform neural-network based methods.For instance, in the IE and ID settings, EdgeBank is better than TGN and Dyrep on the LastFM dataset.Moreover, on the Enron dataset, PA outperforms the three other methods in the OD setting, and slightly outperforms TGN in the ID and IE setting.This is counter-intuitive when remembering that PA doesn't retain any information about the edges active in the past, but only remembers which node was active in the past.Figure 4: Test AUC results obtained by comparing the scores of the positive events with the scores of the negative events, sampled using specific strategies.For Dyrep and TGN, we retrained the models with 5 different seeds and report the mean and the standard deviation of the resulting AUCs.
In general, TGN seems to yield a higher score than Dyrep.This makes sense considering that Dyrep has been shown to be a special case of TGN.However, there are some exceptions, notably in the Overlap destination setting where Dyrep is better on LastFM, UCI and Enron.
6.2 Which competing Negative Sampling strategy misleads the model more over time?
In the previous experiment, we observed that using a different NS strategy can dramatically alter the prediction results.However, it is crucial to note that performance evolves over time in the Temporal Graph.In this section, we demonstrate how distinguishing different NS techniques can help provide more nuanced insight into the performance of each method over time.
On Figure 5 we observe the performance over time, against Edge-NS.For all methods and datasets, we note that the rank of the positive events increases over time while the rank of the other edges decreases.Indeed, as memory gets filled over time, more events will appear to be likely.In particular, the score of randomly sampled Historical and Overlap edges will go up in general as compared to the score of the positive event.
A general trend is that the rank of Inductive edge/destination NS (green and red lines respectively) increases on the training period, before dropping on the test set.This type of negative samples will eventually be the ones which will confuse the model during test time, as their rank becomes closer and closer to the rank of the positive event after t split .In general, historical and Overlap items follow the opposite trend: they become more likely up to t split , and their rank then starts increasing again either at t split , or shortly before (notably for historical edges/destinations).
These plots make it clear that the type of error made by the models changes over time.Indeed, while on the train set, Inductive edges are likely to be scored low (less likely) since they have not been observed yet, they become much more likely on the test set.During the test period even, for instance looking at EdgeBank on the UCI dataset in Fig. 5, shortly after t split , the model will tend to rank the positive events lower than the Overlap and historical edges.After some time, however, the Inductive edge negative samples will tend to be ranked similarly to the historical and Overlap.
These visualizations give profound insight on the differences in performance between TGN and Dyrep.For instance on the LastFM dataset, TGN seems to be more consistent in ranking the Inductive Destination high compare to the other destination nodes.However, its overall performance on the Test set is not clearly better than Dyrep.On the other hand, in Fig. 6, the performances on the UCI dataset indicate that, while TGN is clearly able to push historical destinations further than the Inductive destination, Dyrep tends to assign them similar rankings across the test set.
On both Figures 5 and 6, the examples on the UCI data allow to visualize the effect of seasonality on the performance.Indeed, for both methods, the rank of the positive event peaks just before the split time.This time corresponds roughly to the lower activity also observed in Figure 2g.
To conclude, the results shown in Fig. 4, 6 and 5 demonstrate that overly optimistic AUC may hide a more intricate reality, especially when taking simple baselines as reference point.While more targeted NS strategies helps identify settings where heuristics outperform proposed heavier models, plotting the performance over time is important to get an idea of how a model's performance will react to domain-specific changes in the data over time.

Guidelines for Practitioners
Before concluding this study, we propose a series of guidelines to practitioners of DLP, with the hope of further improving the evaluation of this task.
1.One key take-away is that evaluating DLP algorithms requires exploring the performance along several dimensions of the data, at the node, edge and time-interval level.2. To do so, as much as possible, a good practice for DLP algorithm is to enable saving the list of scores of each positive event and negative event.Indeed, these scores can be the starting point of in-depth evaluations such as the one conducted in this paper.As suggested in [30], this type of standardized output format is critical in order to easily apply diverse evaluation methods, while minimizing evaluation error.3. Birth-Death diagrams are important tools to understand the effect of time-based train-test splitting on the nodes and edges involved in the graph, and may help hypothesizing the expected performance of a given method on a given dataset.4. Simple baselines such as EdgeBank and Preferential Attachment Baselines are indispensable in order to understand whether the model learns anything non-trivial from the data.5. Finally, visualizing how the prediction performance evolves over time can be crucial in understanding the strengths and weaknesses of DLP algorithms.

Conclusion
Recent academic efforts have been dedicated to standardizing Dynamic Link Prediction (DLP) as a machine learning task.The goal is to equip the task with its own evaluation pipelines, baseline methods, and benchmark datasets.However, as a consequence of the high-dimensionality of the data and its non-independent, non-identically distributed nature, deriving a single consistent model validation has encountered numerous challenges.
In this paper, we have explored several key aspects of these challenges.On the one hand we have investigated the effect of time-based train test splitting on the set of nodes and edges through the novel Birth-Death diagrams, and discussed examples of these visualizations on datasets from diverse domains.Moreover, we have shown how to rely on the proposed Birth-Death diagrams to derive more challenging negative samples, based on the hypothesis that the error depends on the NS strategy.To illustrate the effect of these negative samples, we conducted an empirical assessment of the impact of different negative samples on performance was conducted.The relative performance of methods across datasets was compared over time by plotting the prediction ranks as a time series, revealing interesting insights into the failure modes of the different methods.
This work raised several open questions that can be explored in future work.First, our evaluation tools could be used to conduct a more exhaustive comparison of existing heuristics such as the ones introduced in [23], first with simple learning based approaches such as the ones discussed in Sec. 2, and then with more recent representation learning methods, with an emphasis on fair comparison.In terms of visualization, the proposed Birth-Death diagrams could be leveraged to visualize higher order structures such as cliques or triangles, or any repeatedly occurring subgraph that can be uniquely identified.

A Birth-death diagrams on User-Item graphs
The dynamic graphs studied in this paper can be classified into three distinct types: • Unipartite Undirected.for instance, in face-to-face interaction dynamic graphs such as the HighSchool data, there is only one type of nodes (students), and an interaction between nodes u and v doesn't have a direction.Contact, SocialEvo, UNvote, USLegis are examples of such graphs.• Unipartite directed: for example, in the Enron e-mail dataset, there is only one type of nodes (users involved in mail exchanges).However, the interactions (e-mails) have an orientation: a given user u sends an e-mail to user v.The datasets UCI, UNtrade, Flights, and Enron are examples of such graphs.• Bipartite: in this case there is a clear separation of the nodes into two node types.Typically, in User-Item graphs such as wikipedia,LastFM, Mooc, reddit, users are always involved in interactions as senders, while items always receive an interaction (a click, a like, a subscription etc...).
As mentioned in the main body of the paper, Birth-Death diagrams allow visualizing when nodes and edges start and stop interacting for the first time in the history of events.However, for Bipartite datasets, there is a clear separation between the nodes that will be involved in events as source nodes and those that will appear as destination nodes.For these specific datasets, plotting the birth and death time of users and items separately yields extra information.On Fig. 7, it can be observed for instance that in all dataset, most of the items are overlap nodes, in the sense that they are observed in both the train and test set.In the Mooc dataset, all the items (courses that can be followed by students) are actually observed at least once during the test period, thus there are no historical destination nodes in that case.
For the LastFM dataset, This is important since for the nodes in the lower right, prediction methods has the chance of accumulating a lot of information about them over time.This is in contrast with user nodes, where it can be seen that users more commonly appear and disappear both during the train period.

B Model Architectures and Hyperparameters used in the experiments
In the main paper we compared the performances TGN-Attn and Dyrep with heuristic baselines.We used the implementation provided in the examples of the open-source TGB library https://github.com/shenyangHuang/TGB/tree/main/examples/linkproppred/tgbl-wiki.
The architecture of the TGN model is the following: 1.The message function is the identify function (same as in the original paper [34]).
2. The aggregation function is the last message aggregator (we keep for each node only the last message received from the batch).
3. The memory updater is a GRU.
4. The embedding module is Temporal Graph Attention Layer.Its purpose is to integrate both the network's connectivity information and temporal data, ensuring that embeddings remain current by combining the memory state with the most recent network information.
5. The edge-level decoder (i.e.Link Predictor) is a 2-layer MLP with 100 hidden units and ReLU activation.
The Dyrep model implemented in the library is very similar to the TGN architecture, with two main differences: • The memory updater is a simple RNN.
• The embedding module is the identity: the memory is used directly for prediction.Note that this makes the model vulnerable to the memory staleness problem.
• However, the messages function is calculated using a graph attention module on the destination node.
Hyperparameters.As the goal of our study is not to maximize the performance but to compare it across Negative Sampling strategy, we used the default hyperparameters values for both TGN-Attn and Dyrep.Thus, the memory, time encoding, embedding dimensions are all set to 100.We use the Adam optimizer [19] with a learning rate of 1e − 4 and a weight decay of 1e − 4. To prevent overfitting we use early stopping on the validation AUC: we stop the training if there has been no improvement in AUC of more than 1e − 3 in the last 20 epochs.

Figure 1 :
Figure1: A Birth-Death diagram on a recording of face-to-face interactions between HighSchool students over 9 days[8].The y and x coordinate for each node/edge represent their first (Birth) and last (Death) interaction time respectively.Given a cutoff-time t split , while the history of interaction gets divided into a train and a test set, the nodes and edges get partitioned into three categories: Historical, Overlap and Inductive.The Surprise Index is the ratio

Figure 2 :
Figure 2: Birth-Death diagrams for Nodes and Edges in datasets from the Dynamic Graph Benchmark from Poursafaei et al.[33].The datasets are split into train and test sets containing 85% and 15% of the events, respectively.

Figure 3 :
Figure3: Changing the test-split ratio linearly from 0.1 to 0.5 changes the node and Edge Surprise Index differently depending on the dataset.The typical test-split ratio of 0.15 is marked as a "*" on the lines.
Test AUC results employing various Destination Negative Sampling strategies: Historical Destination (HD), Overlap Destination (OD), and Inductive Destination (ID).These strategies swap the destination node with one present during training, both training and testing, or exclusively during testing.

Table 1 :
Notation used throughout the paper.
).A Historical (H) node or edge x only occurs in the train set H train and never in the test set H test , i.e.Definition (Inductive).An Inductive (I) node or edge x only occurs in the test set H test and never in the train set H train , i.e.Definition (Overlap).An Overlap (O) node or edge x occurs in both the train set H train and the test set H test , i.e.
b x H < t split ∧ d x H ≥ t split .

Table 2 :
Dataset Statistics, including the Node and Edge Surprise Index, for a test-ratio of 15%.For most datasets the nodes are mostly all observed during training, hence a relatively low Node Surprise Index.The Edge Surprise is higher however, since many edges that interact in the test set were never observed in the test set. of carefully selecting cutoff times to ensure that the evaluation setting accurately reflects the difficulty and type of task at hand.

Table 3 :
Confusion Matrix Test AUC results employing various Edge Negative Sampling strategies: Historical Edge (HE), Overlap Edge (OE), and Inductive Edge (IE).These strategies involve sampling negative edges (u',v') from sets corresponding to edges exclusively present during training, those present during both training and testing, and those exclusively present during testing, respectively.