1. Introduction
Combinatorial optimization (CO) on temporal graphs is a rapidly evolving area of research focusing on extending classical combinatorial problems to dynamic settings [
1,
2]. Traditional approaches often handle these problems in static contexts, where the attributes of the graph remain unchanged [
3]. In contrast, real-world applications, such as transportation, social networks, biological and telecommunication networks, require consideration of evolving features over time, adding substantial complexity to the optimization process [
4,
5].
Among these problems, the Minimum Timeline Cover (MinTCover) has recently gained the attention of researchers, since it represents a particular case of network summarization tasks that involves identifying latent activity intervals for all entities, producing an activity timeline that encompasses the entire network [
6,
7,
8,
9,
10,
11]. The MinTCover problem has been introduced with the goal of identifying crucial time intervals that elucidate significant network events, thus overcoming the main limitations of other temporal network summarization solutions, which, while effective, can be complex and difficult to interpret [
12,
13,
14,
15,
16,
17]. To simplify, research has moved towards the use of activity time intervals to represent interactions between entities [
6].
As an example, consider how we can identify emerging skills required by companies through the analysis of job posting graphs. Job postings can be represented as graphs where nodes correspond to skills, and edges connect skills that co-occur within the same posting. By aggregating job postings over time for a specific position, we can construct a temporal graph of skills. For instance, the launch of ChatGPT in November 2022 rapidly transformed the workflows of data scientists in machine learning projects, prompting companies to seek professionals with expertise in Generative AI. 
Figure 1 illustrates this process: on the left, a co-occurrence graph depicts skills as vertices, with edges representing their co-appearance in job postings over time. On the right, a temporal network model visualizes these interactions, showing how timelines of (entity, time-interval) pairs can provide rich insights into significant events. These timelines, highlighted in blue, capture pivotal transitions, such as the initial focus on traditional machine learning skills and the subsequent demand for Generative AI expertise, underscoring the central role of skills in this context. This approach to mapping event-driven timelines is fundamental to addressing the MinTCover problem.
Despite the interpretability and effectiveness of this problem, studies on the computational and approximation complexity highlight that MinTCover is NP-hard [
6,
7,
8,
9].
For this problem, some approximate and heuristic-based approaches have been proposed [
6,
10,
11,
18] which have different performances based on the characteristics of network considered (e.g., density, number of timestamps, number of nodes). However, no deep learning (DL)-based approaches have been proposed in the literature, despite the proven effectiveness of these solutions in dealing with vertex cover problems. This is also due to the complexity of modeling information on a temporal knowledge graph in a representation learning fashion, since algorithms need to take into account both structural and temporal patterns.
Hence, in this paper, we present a DL-based algorithm for solving the MinTCover problem with two main goals: (i) to design a solution which can compute better results with respect to approximate or heuristic based solutions, thus providing evidence of the effectiveness of DL-based approaches for CO over temporal graphs; (ii) to explore the potential of DL algorithms for the representation and embedding of temporal knowledge graphs, aiming to provide novel insights into how advanced DL techniques can capture and analyze temporal dynamics within knowledge graphs, thereby underscoring the power and versatility of these approaches in integrating evolutionary and time-sensitive information. We, thus, propose an approach based on the combination of Graph Neural Networks (GNNs), Transformer, and Pointer Networks (PNs), enabling a nuanced representation of both the structural and temporal dynamics of nodes in temporal graphs, to build an initial solution to the problem. However, since DL by itself is not enough to grant that the solution is valid, the proposed approach also leverages a novel algorithm for adjusting the computed solution and providing theoretical guarantees on the results.
The rest of the paper is organized as follows: 
Section 2 formulates the MinTCover problem, defining key concepts and the theoretical background necessary for understanding our approach. 
Section 3 reviews related works, including heuristic and approximate methods as well as recent DL approaches for CO on temporal graphs. In 
Section 4, we introduce the 
DLMinTC+ methodology, detailing the DL components of our model and the iterative adjustment algorithm. In 
Appendix A, we provide the mathematical details of the DL algorithms adopted. 
Section 5 presents an extensive experimental evaluation, comparing 
DLMinTC+ to baseline approaches across synthetic and real-world datasets, with a focus on coverage precision and computational efficiency. Finally, 
Section 6 concludes the paper, summarizing our findings and discussing potential directions for future research.
  2. Problem Formulation
Let  be a temporal graph, where V represents the set of nodes and E denotes the temporal edges. Each edge is characterized by a triple , with  and  indicating the timestamp at which the interaction between u and v occurs. We focus on undirected and unweighted temporal graphs.
For any vertex , we define  as the set of temporal edges incident to u. Similarly,  represents the set of neighbors of u at time t, while  captures the set of timestamps where u participates in an edge.
The degree of a vertex u at timestamp t is defined as , which quantifies the number of temporal edges involving u at time t. We define the density of a temporal graph as the ratio of existing edges to the maximal possible number of edges, expressed as . A graph is regarded as sparse if the number of edges satisfies .
Given two values  and , where , we define the activity interval of vertex u as . The set of such intervals for all vertices, denoted as , forms an activity timeline of G. The span of an interval  is given by . We say that a timeline  covers a temporal graph  if for every edge , the timestamp t belongs to either  or .
A naive timeline, defined as , provides full coverage but may include unnecessarily extended intervals. The objective is thus to determine a timeline minimizing the sum-span, expressed as .
Problem 1  (). Given a temporal graph , we identify a timeline  that covers G while minimizing the sum-span .
 With the above definition, we can state our goal, which requires us to define the problem as follows. Given a temporal graph , where V is the set of vertices and E is the set of temporal edges, we find a parametric function f that maps the input temporal graph G to the set of activity intervals , denoted as , where  represents the parameters of the red function.
Suppose each vertex u can be represented at each timestamp t where it is active as a feature vector  through a parametric function g. Let us suppose these representations capture both the temporal and structural properties of the graph relevant to determine the activity intervals, considering the fact that if an interval  for node  is part of the activity timeline it influences, the interval  for node u should be part of the activity timeline if u and v are adjacent in some timestamp. Thus, for each node, we have a sequence of representations for each timestamp. Let us call these sequences  and , the parametric function that maps the input graph G to its representation , with  parameters of the function. For each of these sequences, we want to model the probability that a certain node u is active at time t, resulting in an interval  that belongs to an activity timeline . Thus, we need to define another parametric function , which predicts the activity interval  of each node u, using the learned parameters .
We reformulate the problem as follows.
Problem 2  (Minimum Timeline Cover with Least Squares)
. Given a temporal graph , where  and E is a set of temporal edges , consider a training pair , where  is a set of node embeddings for each node u at each timestamp t generated by the function , and  is the set of observed activity intervals for each node u, we define a functionfor which we seek to find the parameters  and  that minimize the sum of squared differences between the observed activity intervals and the predicted ones , across all nodes and timestamps:subject to the coverage constraint By formulating the MinTCover problem as a learning task, we aim to learn a mapping from temporal graphs to activity timelines that firstly effectively builds temporal node representation and then efficiently covers all interactions with a minimal total active duration. This approach allows us to leverage DL techniques to approximate optimal solutions.
  3. Related Works
To address the challenges of the MinTCover problem on temporal graphs, this study explores two complementary areas of research that align with our objectives: heuristic-based methods and DL for CO. These focus areas are motivated by the dual nature of the problem. First, heuristic approaches have been widely studied for CO tasks, offering practical solutions with computational efficiency. However, these methods often struggle with capturing the intricate temporal and structural dynamics of complex, large-scale temporal graphs, leading to suboptimal solutions in terms of accuracy and coverage size. Second, DL has demonstrated significant potential in addressing these limitations in static cases by leveraging its ability to learn from data, model dependencies, and generalize across diverse instances; however, the application of DL approaches to dynamic cases is still little explored, and studies mainly focus on using DL to build suitable temporal representation in various ways. Therefore, our objectives are to (i) evaluate and extend heuristic methods for their strengths in computational efficiency and scalability, and (ii) develop and benchmark a DL-based approach that can achieve superior precision and adaptability in solving the MinTCover problem.
  3.1. Approximate Algorithm and Heuristics
In [
6], Rozenstein et al. introduce an iterative method called 
Inner to solve MinTCover by first addressing the 
Coalesce subproblem. This involves defining key time points for each vertex, called inner points, around which activity intervals are constructed to cover all interactions. The authors provide a 2-approximate solution to 
Coalesce in linear time by formulating an ILP model, relaxing its integer constraints, and iteratively refining the inner points until convergence. They do not establish an approximation factor for MinTCover but can compute a feasible solution efficiently.
Dondi et al. [
10,
18] propose 
2Phases, an 
 approximation algorithm. They construct a union graph 
 by aggregating timestamps of temporal edges. The first phase applies a 2-approximate Minimum Vertex Cover (MVC) algorithm [
19] to select activity intervals. The second phase, based on a SetCover approximation [
20], uses randomized rounding to determine a Minimum Non-Consecutive Timeline Cover, later transformed into a MinTCover solution.
Lazzarinetti et al. [
11] introduce 
FastMinTC+, a heuristic method extending an MVC approach. It consists of an initialization phase that selects high-degree vertices for coverage and an iterative refinement phase where low-loss nodes are removed, and new nodes are selected using the Best from Multiple Selection (BMS) heuristic. The process iterates until a minimal timeline with a reduced sum-span is obtained. Their benchmark comparisons show that 
FastMinTC+ is computationally efficient, consistently outperforming 
Inner in execution time and proving superior for dense graphs, while 
2Phases, despite its theoretical guarantees, struggles with scalability in large instances.
  3.2. Deep Learning Approaches for Combinatorial Optimization
Beyond heuristic and approximate methods, DL approaches have been explored for solving NP-hard problems, yielding promising results [
21,
22,
23]. Most research focuses on CO problems over static graphs [
24], where the challenge is capturing the combinatorial structure of the network through spatial features. However, incorporating the temporal dimension significantly increases complexity, making existing DL-based methods ineffective in providing reasonable solutions. This limitation arises because representation learning techniques designed for static graphs struggle to handle temporal dynamics properly [
25].
For this reason, in recent years, temporal graph embedding has gained significant attention. Various approaches have been developed, and they can be categorized into several distinct research paradigms.
- Event-Based Temporal Graph Embedding: These methods treat node interactions as timestamped events to capture precise temporal dependencies, like in  DyRep-  [ 26- ] and  TGAT-  [ 27- ]. They provide high temporal resolution but are computationally demanding, especially for high-frequency events. 
- Snapshot-Based Temporal Learning: Temporal graphs are divided into snapshots, with conventional GNNs applied to each one. Models like  EvolveGCN-  [ 28- ] and  DCRNN-  [ 29- ] work well for gradual changes but require significant memory and may miss intermediate dynamics. 
- Self-Supervised Learning and Pre-Training: These methods use automatically generated pseudo-labels for training, reducing the need for labeled data. Examples include  TREND-  [ 30- ] and  T-GCL-  [ 31- ]. They are adaptable to low-data scenarios but can be sensitive to negative sampling strategies. 
- Hybrid Models: Combining components like temporal attention and convolutions, hybrid models capture complex interactions.  TMac-  [ 32- ] and the  Temporal Graph Collaborative Transformer-  [ 33- ] are flexible but can suffer from high complexity and training time. 
- Inductive Learning and Scalability: Focusing on scalability, models like  GraphSAGE-  [ 34- ] generate embeddings for new nodes and scale well to large graphs. However, they may lose some global context and require extensive hyperparameter tuning. 
- Application-Specific Models: Tailored for tasks like link prediction and anomaly detection, these models, such as  JODIE-  [ 35- ] and  DySAT-  [ 36- ], perform well for specific applications but have limited generalizability and require domain-specific adjustments. 
The approaches demonstrate the diversity and complexity of temporal graph embedding methods, each addressing specific challenges in dynamic network representation.
  4. Deep Learning-Based Minimum Timeline Cover
To solve Problem 2 in a data-driven fashion, we propose integrating DL techniques to derive actionable insights from temporal graph data, aiming to optimize decisions dynamically over time. To compute the function as in Equation (
1), we can use a pointer mechanism as described in [
37]. The pointer mechanism enables the model to 
point directly to elements within the input sequence, rather than generating new output tokens. This is achieved by learning a probability distribution over the input positions, effectively guiding the network to select certain elements based on their relevance to the task. As a result, PNs can dynamically adapt their outputs based on the input structure, making them highly effective for problems where the order and relationships between elements are essential, such as in graph-based CO tasks. Thus, the key goal is to transform the temporal graph’s data into a format that a PN can effectively utilize to make informed decisions. This involves embedding the graph data into a numerical form that captures both the structural and temporal dynamics, and then using these embeddings to drive a PN decision-making process. For background on the DL models adopted, please refer to 
Appendix A.
Among the possible solutions for temporal graph embeddings, we argue that the best choice for solving MinTCover in a data-driven fashion is represented by models that can scale to instances of different sizes and complexities. Moreover, considering that there exists a trade-off between the amount of processed information and scalability, we believe that, to find an activity timeline, it is better to focus on the temporal distribution of nodes within the entire span, with respect to building a timestamped node representation considering the entire network topology at a given point in time.
According to these assumptions, we designed a DL approach enforced with an iterative procedure to solve the MinTCover problem in a data-driven fashion. The procedure is broken down into the following stages, better highlighted in 
Figure 2:
- Node Representation: Firstly, each node is represented by computing the degree of that node in the temporal graph, injecting information on the topology of the network at each timestamp (bottom-left part of  Figure 2- ). 
- Graph Embedding: The process begins with the embedding of the temporal graph, where node features are encoded at each timestamp. This captures the topology of each graph over time, mainly considering local features (bottom-right part of  Figure 2- ). 
- Temporal Aggregation: Nodes’ sequences through time are defined and processed to transform the representation of each node in each timestamp considering the entire sequence (top-left part of  Figure 2- ). 
- Pointer Mechanism: A pointer mechanism uses these embeddings across time, producing an output that encapsulates the temporal evolution of node features (top-right part of  Figure 2- ). 
- Iterative Adjustment: The output of the pointer mechanism includes the interval to be assigned to each node; however, they may not be a valid cover. Thus, an iterative algorithm is proposed to adjust the intervals to grant that the output is a coverage (top part of  Figure 2- ). 
This integrated approach leverages the strengths of GNNs, Transformers and PNs. By transforming the graph data with GNNs and Transformer into a state representation that reflects both their structural properties and temporal changes, the PN can select the starting and ending points of each node sequence which correspond to the interval. Finally, an iterative procedure can optimize this interval, providing a valid coverage. In the following, we will refer to this approach as DLMinTC+.
  4.1. Generation of Temporal Node Features
To build the activity timeline in a data-driven fashion, firstly, we need to represent each node of the temporal graphs as a sequence of embeddings. To derive this sequence 
 from the temporal graph, we use a two-phase process involving GraphSAGE to obtain node embeddings for each timestamp and a Transformer to aggregate these embeddings over time. Suppose we have a temporal graph represented by a sequence of graphs 
, where each graph 
 has the same set of nodes 
V but a varying set of edges 
 that changes over time. For each node 
 and each timestamp 
t, we compute the local degree of node 
v at timestamp 
t as follows:
  4.2. Application of GraphSAGE
For each timestamp 
t, we apply a GraphSAGE model to obtain node embeddings as described in 
Appendix A. GraphSAGE aggregates information from neighboring nodes to compute the embedding of node 
v at timestamp 
t:
        where 
 is the array of node degrees at each timestamp 
t, 
 is the embedding of node 
v at layer 
k and timestamp 
t, 
 is the set of neighboring nodes of 
v, and 
 are the trainable weights of layer 
k. In this case, we use the ReLU function as the activation function and the mean operator as the AGGREGATE operator, including the node itself in the computation.
The final embedding of node v at timestamp t is given by , where K is the number of GraphSAGE layers.
Once we have obtained the node embeddings for each timestamp t, to actually use a PN to reason over the sequences of these embeddings over time, we need to build this sequence for each node.
  4.3. Temporal Encoding with Transformers
For each node 
v, we construct a sequence of temporal embeddings:
        where 
 and 
d is the dimension of the embedding. This sequence, however, is composed of representations of nodes built considering only static information. No temporal processing has been performed until now. To allow the PN to effectively make informed decisions, this sequence must be reprocessed to effectively embed the temporal evolution of the network in each node representation at each timestamp. Thus, we apply a Transformer with positional encoding to the sequence 
 to obtain the final embedding of node 
v, as described in 
Appendix A.3.
Since the embeddings 
 do not inherently capture the temporal structure of the sequence, we first add temporal positional encodings to each timestamp to encode the order information. The positional encoding for a given timestamp 
t is defined as in Equation (
A11). We then add the positional encoding to each input embedding to obtain the temporally encoded sequence:
The resulting sequence is denoted as
We then apply a Transformer layer to the temporally encoded sequence 
 to capture temporal dependencies between different timestamps. Each Transformer layer consists of a Multi-Head Self-Attention mechanism followed by a feedforward neural network (FFNN), for which details are provided in 
Appendix A.1. The self-attention mechanism computes three matrices— Query, Key, and Value—using learned linear transformations:
        where 
 are the learned weight matrices, and 
.
These are used to compute the scaled dot-product attention as in Equation (
A9). Finally, instead of a single set of 
 matrices, the Transformer utilizes multiple attention heads to capture diverse temporal patterns as in Equation (
A10).
The output of the multi-head attention layer is passed through an FFNN applied independently to each timestamp and each sub-layer (self-attention and FFNN) is followed by a residual connection and a layer normalization step. The output of the multi-head attention layer is
        and the final output of the Transformer layer is
The resulting sequence  captures the temporal relationships between the timestamps while preserving the dimensionality of the input embeddings. Thus, each element  is now conditioned on the entire sequence, enabling temporal dependencies to be modeled explicitly.
  4.4. Application of the Pointer Network
Once we have constructed the sequence of node embeddings for each node v, represented as , the next step is to apply the PN to determine the optimal activity intervals for each node based on these embeddings.
The primary goal of the PN in the context of the MinTCover problem is to select, for each node in the temporal graph, the most effective activity interval that covers all relevant interactions within the shortest possible duration. The Pointer Network achieves this by processing sequences of embeddings that encapsulate both the structural and temporal dynamics of the graph, thus identifying the optimal start and end points of activity for each node.
The PN processes the sequence of embeddings  by applying a softmax pointer layer that computes a probability distribution over each timestamp in the sequence . This distribution is indicative of the likelihood that a particular timestamp is the starting or ending point of the optimal interval for node v.
Suppose the output from the Transformer for node 
v is a sequence of vectors 
. The PN then applies an attention function 
a to compute the probabilities:
        where 
 are the learnable parameters of the network, and 
 is the Transformer output at time 
t.
The final output of the PN is a set of intervals  for each node v, where  and  are selected based on the highest probabilities from the softmax outputs. These intervals have to be chosen to minimize the overall activity span while ensuring all necessary interactions are covered.
  4.4.1. Custom Loss Function
In order to properly train the network to also include the logic and the constraints of the MinTCover problem, we decided to use a custom loss to balance covering, minimizing sum-span and granting the integrity of the solution. The loss function, as described in 
Appendix A.1, quantifies the difference between the model’s predictions and the actual target values, serving as a guide for the model to improve during training. By minimizing the loss, the model can learn the patterns in the data more accurately.
The proposed loss function aims to optimize the temporal coverage problem by incorporating three main components: (1) coverage loss, (2) span alignment loss, and (3) a penalty term. Let  denote the i-th graph in a batch of graphs, where  is the set of nodes and  is the set of temporal edges. Here, a batch refers to a subset of data samples processed simultaneously during training in gradient descent algorithms to efficiently estimate gradients and update model parameters. Each edge  represents an interaction between nodes u and v at timestamp t.
  Coverage Loss
The coverage loss aims to ensure that each temporal edge is adequately covered by the predicted activity intervals of its associated nodes. For an edge , let  and  denote the start and end times of the predicted interval for node u, respectively, and similarly  for node v. The goal is to penalize the model when neither u nor v is active at timestamp t.
Mathematically, we define the following indicator functions:
            which indicate whether timestamp 
t is covered by the interval of nodes 
u and 
v, respectively.
The total coverage loss for the graph 
 is given by summing the coverage loss for each edge over all edges:
This loss attempts to ensure that most edges are covered by at least one of the associated nodes.
The direct use of indicator functions makes the coverage loss non-differentiable at the boundary points where  or . However, in the current implementation, the indicator function is implicitly relaxed using a ReLU. This smooth approximation enables the computation of gradients for the coverage loss component with respect to the start and end times. Thus, the resulting coverage loss is differentiable almost everywhere, except at points where the smooth approximations switch values (e.g., at ).
  Span Loss
The span loss aims to align the predicted start and end times with the observed activity intervals for each node. Let  and  denote the observed start and end times of node u. The goal is to minimize the difference between the predicted and observed intervals.
The span loss is defined as the mean squared error (MSE) between the predicted and observed intervals for each node (
):
This loss ensures that the model learns to predict intervals that closely match the observed ones, reducing the discrepancy between the predicted activity spans and the actual activity periods.
Each term in the summation of the span loss is a quadratic function of the start and end times  and . Quadratic functions are differentiable everywhere with continuous first and second derivatives. Therefore, the span loss is fully differentiable and provides smooth gradients for optimizing the alignment between the predicted and true intervals.
  Penalty Term
The penalty term is introduced to enforce the constraint that the start time  should always be less than or equal to the end time  for each node u. This ensures that the predicted intervals are valid.
The penalty term is defined as
The squared penalty grows as  approaches or exceeds , thereby strongly discouraging invalid intervals.
The ReLU function itself is piecewise linear and non-differentiable at . However, the squared ReLU term  is differentiable everywhere except at the point , where the gradient changes abruptly. Despite this, the penalty term contributes useful gradients in almost all regions of the parameter space and ensures that the optimization process is well behaved except at a few boundary points.
  Total Loss for a Single Graph
The overall loss for a single graph  can be computed as a weighted combination of the coverage loss, span loss, and the penalty term. To stabilize the gradients and avoid issues with small loss values, we apply a logarithmic transformation with a small constant  to each element of the loss. The logarithmic transformation is differentiable for all positive values of  and provides smooth gradients. The inclusion of  ensures that the argument inside the logarithm is strictly positive, thereby making the entire transformation differentiable everywhere.
The final loss is given by
The final loss function is a weighted sum of the adjusted logarithm of the coverage loss, span loss, and penalty term, all of which are differentiable. Since the weighted sum of differentiable functions is itself differentiable, the entire loss function is differentiable, enabling the use of gradient-based optimization techniques.
  Hyperparameter Sensitivity
The choice of weights  is critical for balancing the contributions of each component. For instance, setting a higher value of  would prioritize edge coverage, potentially leading to larger intervals, while increasing  would focus more on accurately aligning intervals with the observed ones. Similarly,  controls the strictness of the  constraint, ensuring valid interval predictions.
  4.5. Iterative Algorithm
Even though properly trained, the activity intervals  produced by the DL approach may not be neither minimal nor valid activity intervals for MinTCover. To grant that the output is a valid set of activity intervals, we suggest adjusting these intervals minimally to ensure that all temporal edges are covered. We thus introduce an algorithm designed to achieve this by minimally extending node intervals where necessary.
  4.5.1. Algorithm Description
The algorithm takes the initial intervals from the DL model and iteratively adjusts them to cover all temporal edges. For each uncovered edge, it decides which node’s interval to adjust based on the minimal increase in the total sum-span .
Given a temporal graph , an activity interval , the total sum-span of all activity intervals  and a set of uncovered edges , the algorithm performs the following steps:
- Initialization: Copy the initial intervals  and initialize the uncovered edges in set U. 
- Edge Checking: For each edge , check if  or . If not, add the edge to U. 
- Interval Adjustment: For each edge , calculate the minimal increase required to adjust  and  to include t. 
- Update Intervals: Adjust the interval of the node with the minimal increase to include t. 
- Iteration: Repeat steps 3 and 4 until all edges in U are covered. 
- Output: Return the adjusted intervals  that now cover all edges in E. 
The algorithm is detailed in Algorithm 1.
  4.5.2. Complexity Analysis
The efficiency of the algorithm is crucial, particularly when dealing with large-scale temporal networks. To analyze the algorithm’s complexity, let us define the following parameters: , , and , where n is the number of nodes, m is the number of edges, and t is the number of timestamps in the network. We can break down the time complexity as follows.
The algorithm begins by copying intervals and initializing a set U, which involves operations on each node in the graph. Consequently, this step has a time complexity of . Then, an edge-checking loop is performed. This loop iterates over all edges in the graph to perform a constant-time check on each. Since the loop covers every edge exactly once, its time complexity is . An interval adjustment loop follows (Lines 7–26). In this step, the algorithm iterates over the edges stored in the set U. In the worst case, U could contain all the edges in the graph, i.e., . Each iteration involves a series of constant-time operations, making the overall time complexity of this loop .
The total time complexity of the algorithm is dominated by the edge-checking and interval adjustment loops, both of which scale linearly with the number of edges. Thus, the final time complexity is 
.
          
| Algorithm 1: Minimal timeline completion | 
| ![Algorithms 18 00113 i001]() | 
  4.5.3. Minimal Timeline Completion Validity
According to Theorem 1, we can show that the output computed by Algorithm 1 is a valid solution (i.e., covers all temporal edges) and compute the solution with a locally minimal increase (i.e., reducing any intervals among those increased results in a solution which is not valid).
Theorem 1.  Given any initial set of activity intervals  and a temporal graph , the Minimal Timeline Completion algorithm produces adjusted intervals  that meet the following criteria:
- They cover all temporal edges in E, i.e., for every ,  or . 
- They are locally minimal with respect to sum-span increase, that is, no final interval can be reduced without leaving at least one temporal edge uncovered. Moreover, each incremental expansion chosen by the algorithm was the minimal possible increase at the time it was made. 
 The proof of Theorem 1 can be carried out by showing first of all that the algorithm produces intervals that cover all edges, and then showing that the increased sum-span produced is minimal, i.e., any further reduction in the increased intervals would result in uncovered edges.
Proof.  Part 1: The algorithm produces intervals that cover all edges. The algorithm starts from an initial set of intervals 
 and identifies all uncovered edges:
For each edge , the algorithm determines the minimal necessary increase to adjust either  or  so that t falls within it. One of these intervals is then extended just enough to cover t. As a result, this edge is no longer uncovered.
Since the algorithm processes each uncovered edge in U in this manner, every edge initially in U will be covered by at least one updated interval after these adjustments. Thus, when no uncovered edges remain, every  is covered by  or . Hence, the final set of intervals  covers all temporal edges.
Part 2: Local minimality of the total sum-span increase. Consider an uncovered edge 
. To cover it, the algorithm calculates the increases required to include 
t in 
 or 
:
The algorithm picks the endpoint with the minimal increase  and adjusts its interval accordingly. This choice ensures that for each uncovered edge, the increase in sum-span is as small as possible at the moment of adjustment.
Now consider the final solution  after all uncovered edges are addressed. Suppose we attempt to reduce the sum-span of this solution by shrinking one of the intervals . Any attempt to decrease  or increase  may exclude a timestamp that was previously uncovered and is now covered only by . Because the algorithm always adjusted intervals minimally, there is no “slack” that can be removed without losing coverage of at least one temporal edge.
This implies the solution is locally minimal: there is no unilateral reduction in the updated intervals that can decrease the total sum-span without uncovering an edge. Furthermore, at each individual step of covering an edge, the chosen increment was minimal. Hence, for each uncovered edge at the time it was covered, no smaller incremental adjustment could have achieved coverage.
Part 3: No further reductions are possible without losing coverage. Since each interval extension was chosen to be the minimal required to cover a newly uncovered edge, removing or shrinking any part of these extensions would exclude at least one such edge. Thus, any attempt to reduce the total sum-span after the algorithm completes would necessarily leave some edge uncovered. Therefore, the intervals in  cannot be reduced without losing coverage, confirming the local minimality of the solution.    □
   5. Experimental Evaluation
In this section, we provide the results of our experiments on the 
DLMinTC+ methodology proposed. Given the benchmark results provided in [
11], which show that on average 
FastMinTC+ is the best heuristic among those proposed in the literature, we evaluate our model against 
FastMinTC+ on the same datasets, i.e., 
Dataset1 composed of synthetic sparse instances of different sizes, 
Dataset2 composed of synthetic dense instances of different sizes, and 
Dataset3 composed of real-world instances from the DIMACS dataset [
38].
We recall that for a more granular analysis, instances of the synthetic datasets are categorized as small (up to 50 vertices and 20 timestamps), medium (up to 500 vertices and 500 timestamps), and hard (up to 10,000 vertices and 5000 timestamps). The final goal is understanding the potential of DL-based approaches for temporal CO problems with respect to heuristic approaches on both sparse and dense instances.
  5.1. Experimental Settings
Experiments are carried out on a MacBook Pro (2017) with MacOS, using a 16 GB RAM and 4 cores of an i7-3, 1 GHz CPU.
For the FastMinTC+ algorithm, we set the parameter  for the BMS heuristic (k represents the number of iterations performed by the BMS heuristic), and we execute the exchange step for 2000 iterations and the overall algorithm 5 times, each with a different shuffling of the edges as suggested by the authors.
As far as 
DLMinTC+’s hyperparameters are concerned, the network has been trained for 50 epochs, with a batch size 
. To optimize the custom loss function throughout the entire network, we used Adam [
39] with an initial learning rate 
 (please refer to 
Appendix A.1 for further details on gradient descent and Adam as optimization strategies for loss computation). To optimize the performance of the model, we performed hyperparameter tuning using a grid search strategy on a validation set composed of 1000 random graphs. 
Table 1 shows the parameters tuned and selected from the tested values. Training took approximately 9 h to complete the 50 epochs, with an average consumption of 
 GB of memory during training. To minimize memory overhead in the GraphSAGE and Transformer modules, we use sparse matrix representations, which inherently benefit from sparse connectivity in large temporal graphs.
To train the neural model, we crafted a synthetic dataset composed of randomly generated graphs of different sizes, densities and spans. Consider that while synthetic datasets offer flexibility and control, they also may not capture all structural and temporal properties of real-world networks or they may omit complexities, such as noise or irregular patterns, that are present in real-world graphs, thus introducing bias. To mitigate these possible drawbacks, we generated synthetic graphs of varying sizes and densities to mimic real-world scenarios. Moreover, we tested the results both on synthetic and real-world graphs, to be sure that the results obtained on synthetic data are comparable to those obtained on real-world graphs. The final training dataset is composed of 5000 graphs for training and 1000 graphs for validation with a random number of nodes between 10 and 10,000, a random number of edges between 0.1 times the number of nodes and 10 times the number of nodes, and a random number of timestamps between 1 and 30,000. This represents the maximum value allowed for the sequence length. This means that by itself, the network is not able to process temporal graphs with more than 30,000 timestamps. However, since in many cases the number of timestamps is larger and since the final output produced is not granted to be a valid coverage, in the case of temporal graphs with more than 30,000 timestamps, we split the temporal graph into sub-temporal graphs with a maximum span of 30,000, process each sub-graph independently, and merge the produced intervals, which are then passed to the final iterative algorithms. This graph partitioning technique not only allows us to extend the approach to graphs of any size, but it also reduces the memory requirements of the DL-based approach. Since computing exact solutions for such graphs is infeasible and no suitable dataset with graphs and corresponding ground truth exists in the literature, we constructed the observed values for training using state-of-the-art heuristics. Specifically, we employed both FastMinTC+ and Inner, the best-performing heuristics across different network configurations (FastMinTC+ generally excels on higher-density instances, while Inner is often superior for lower-density instances). For each graph, we generated solutions using both heuristics, selecting the one yielding the best results in terms of sum-span. Although it is not possible to train the network on optimal solutions, this approach enables the model to learn to approximate the best solutions regardless of the network’s characteristics, thus taking the best of both the state-of-the-art heuristics. Provided that the training is executed correctly, the resulting model should, barring errors, produce coverage results as close as possible to those achieved by the best-performing state-of-the-art heuristics.
To ensure that, by the end of the training phase, the network had effectively learned to construct meaningful coverage, we tested its ability to replicate the coverage generated by the heuristics. We conducted this evaluation with the understanding that the output produced by the network might neither be minimal nor entirely valid, aiming instead to refine the result further using the proposed iterative algorithm. Thus, we measured the performance using the classical metrics for classification problems: 
accuracy which measures how often the model correctly predicts the outcome, 
precision which measures how often the model correctly predicts the positive class, and 
recall which measures how often the model correctly identifies positive instances from all the actual positive samples in the dataset. In this case, however, since the output of the network is an interval for each node, we convert the interval to a sequence of 0 and 1 for each timestamp for each node, which indicates weather the node is active in that timestamp or not, and we measure the problem as a binary classification problem. For this evaluation, we used a synthetically generated test dataset of 500 graphs similar to the training and validation dataset. Counting each node for each timestamp in this dataset results in a test case of more than 
 nodes. The results are reported in 
Table 2. Considering the size of the support, the training results, with an overall accuracy of 72.13%, while not extremely high, represent strong performances given the scale of the test cases. These results also indicate both the stability and sufficient generalization of the model, highlighting its robustness across a substantial dataset.
  5.2. Experimental Results
The comprehensive results for 
Dataset1, consisting of sparse graphs, and for 
Dataset2, consisting of dense graphs, in terms of sum-span and execution time are reported in 
Figure 3. 
DLMinTC+ outperforms 
FastMinTC+ in terms of sum-span on both sparse and dense instances, despite it taking a longer execution time. Indeed, on 
Dataset1, it achieves an average reduction of 9.3% in coverage size compared to 
FastMinTC+, demonstrating the model’s better ability to learn temporal patterns that reduce the overall interval count, despite the 29% increase in execution time compared to 
FastMinTC+, due to the added computational complexity required by the DL model. Similarly, on 
Dataset2, 
DLMinTC+ continues to outperform 
FastMinTC+ in terms of coverage size, with an average reduction of 7.5%. This highlights the model’s effectiveness in capturing complex structures and interactions also in dense temporal graphs. However, the execution time shows a more noticeable increase, averaging 38% due to the greater complexity of the graphs, which requires more iterations and calculations to produce the activity intervals.
As we can see in 
Table 3, the increase in execution time comes with a corresponding increase in memory consumption on both 
Dataset1 and 
Dataset2. This is due to the complexity of the trained neural model, which, even if it does not require memory to compute backpropagation at the inference time, needs memory to load the entire graph.
Experiments conducted on publicly available graphs corroborate the findings observed in the synthetic test instances too. Detailed results for these real-world instances are presented in 
Table 4, 
Table 5 and 
Table 6. For each graph considered, 
Table 4 provides the results in terms of sum-span, 
Table 5 provides the results in terms of execution time, and 
Table 6 provides the results in terms of memory consumption.
In this scenario, 
DLMinTC+ shows variable behavior depending on the graph structure. 
Figure 4 shows the details of the comparison with respect to sum-span percentage reduction and execution time percentage increment. The model achieves an average reduction in sum-span of 13.4% compared to 
FastMinTC+, with some cases reaching a reduction around 50% and others with reductions limited to around 2–3%. This clearly depends on the size and complexity of the network analyzed and on the behavior of 
FastMinTC+ too. Indeed, if we better analyze 
Figure 4, we can see that the highest improvement is obtained in cases of graphs with smaller densities, which in general correspond to cases of graphs with a larger number of timestamps. This means that the proposed approach is better at scaling with larger instances with complex temporal structures. The variability in sum-span reduction is also reflected in the execution time, with an increase of around 50% on average. This is because the baseline heuristic processes simpler graphs more efficiently, whereas the DL-based approach incurs additional computational overhead even for less complex structures.
Thus, overall, 
DLMinTC+ demonstrates superior performance in terms of sum-span compared to 
FastMinTC+ on all graphs. Specifically, in the analysis of 25 real-world instances, 
DLMinTC+ achieves a better sum-span in all cases. Notably, 
DLMinTC+ not only outperforms 
FastMinTC+ on all instances but, comparing results presented in [
11], also the 
Inner and 
2Phases algorithms on all instances. Note that the improvements achieved can be translated into tangible benefits in various real-life scenarios. As an example, consider the case of telecommunication networks. In temporal graphs representing communication events, a reduction in the sum-span by 10% means a significant reduction in the time and resources needed to monitor and maintain network activity logs. For large networks with millions of interactions, this can reduce storage requirements by gigabytes, ensuring faster query responses and lower costs. In the case of transportation scheduling, instead, a 10% improvement in the coverage size of dynamic transportation networks (e.g., bus or train schedules), translates to fewer overlapping intervals when monitoring vehicle activity. This could result in optimized resource allocation (e.g., fewer active vehicles needed during certain time intervals) and improved passenger wait times.
The results also confirm the performance in terms of execution time and of memory consumption of the FastMinTC+, which always achieves a lower execution time and a lower memory consumption compared to DLMinTC+, even though the order of magnitude of the execution time is comparable and memory consumption is still limited, making it a suitable solution in real-world scenarios.
  6. Conclusions
This work introduces DLMinTC+, a DL-based algorithm designed to address the MinTCover problem on temporal graphs. Our approach centers on two primary goals: first, to develop a model capable of outperforming traditional heuristic and approximate methods, demonstrating the effectiveness of DL in solving CO problems on dynamic networks; second, to advance the field of temporal knowledge graph analysis by showcasing how DL techniques can capture both temporal and structural nuances, providing robust embeddings that enable meaningful analysis of complex temporal interactions.
DLMinTC+ solves the MinTCover problem in a data-driven fashion, combining GraphSAGE for node embedding, Transformer-based temporal encoding to capture sequential dynamics, and a pointer mechanism to determine optimal start and end points for activity intervals. This architecture is designed to model both structural and temporal features of temporal graphs effectively. To train such network, we define a custom loss function to balance covering, minimizing sum-span and granting integrity of the solution. Despite this, since the output of DL models is non-deterministic, to grant that the computed solution is valid, we also provide a novel iterative algorithm to build a valid final timeline, providing theoretical guarantees on the validity and minimality of the solution.
We test our methodology on synthetic and real instances. The results of our experiments underscore the stability and generalization ability of the model, particularly on dense, highly connected datasets, where it achieves significant coverage efficiency. While DLMinTC+ incurs slightly higher computational time, this trade-off is offset by the model’s enhanced coverage precision, making it a viable approach for applications prioritizing solution accuracy.
Looking ahead, several directions for future research can further advance this work: while DLMINTC+ achieves notable improvements in coverage size, the computational cost for large graphs remains a challenge. Future research could explore distributed or parallel processing strategies, more efficient graph partitioning techniques, and hardware-accelerated solutions to enhance computational efficiency. Moreover, although the proposed method achieves strong performance, further work could focus on improving the interpretability of the model’s outputs. Techniques such as attention visualization or feature importance analysis can help users better understand the decision-making process and ensure trust in critical applications.
Even though this study paves the way for further exploration of DL methodologies in CO on temporal graphs, suggesting exciting possibilities for scalable and adaptive solutions in dynamic network scenarios, these avenues for future research aim to enhance the scalability, generalizability, and practical utility of DL approaches for temporal CO, paving the way for broader adoption in dynamic network analysis.