1. Introduction
Information systems in finance, healthcare, manufacturing, and public administration continuously record detailed execution logs of business processes. Process mining has emerged as a discipline that extracts insights from these event logs to discover process models, analyse conformance, and support operational decision making. Predictive process monitoring extends this paradigm by learning models that anticipate the future behaviour of running cases. Typical objectives include predicting which activity will occur next, how much time remains until completion, or whether a case will violate a service-level agreement or lead to an undesired outcome, such as breaching service-level agreements or triggering undesired exception cases [
1,
2,
3].
Early predictive process monitoring methods relied on hand-crafted features extracted from traces and traditional machine learning models. Subsequent work introduced recurrent neural networks, particularly LSTMs and GRUs, which substantially improved next-activity and remaining-time prediction by encoding prefixes of event sequences [
4,
5,
6]. More recent approaches adopt Transformer architectures and sophisticated positional encodings to handle long and complex traces [
7]. In parallel, researchers have started to exploit the graph structure of processes. This includes graph neural networks defined on Petri net place graphs, directly follows graphs, and object-centric event graphs [
8]. These models show that graph structure carries complementary information beyond textual activity labels and timestamps.
Despite this progress, three important limitations remain. First, most predictive monitoring models are trained and evaluated on a single event log. They tend to overfit to the idiosyncrasies of a single process and do not generalise well to different processes, even when they belong to the same organisation or domain [
9]. Second, existing graph-based models are typically small and task-specific. They are designed to solve either next-activity prediction or remaining-time prediction on one log, and they do not exploit the possibility of pretraining across multiple logs. Third, even when graph structure is used, most methods rely on simple representations such as directly follows graphs and do not fully capture the joint interactions between events, cases, and resources [
10,
11].
At the same time, the broader graph learning community has started to explore graph foundation models, which are large graph neural backbones pretrained on heterogeneous graphs with self-supervised objectives and later adapted to diverse tasks [
12,
13]. Such models have demonstrated strong transfer performance in recommendation, molecular property prediction, and knowledge graphs, and surveys now classify them as a new generation of representation learning for structured data [
14,
15]. However, graph foundation models have not yet been systematically explored in the context of process mining and predictive monitoring. This gap suggests a natural research question: can one develop a domain-specific graph foundation model that pretrains on multiple event logs and generalises across predictive process monitoring tasks?
This paper answers this question by introducing ProcessGFM, a domain-specific graph pretraining prototype tailored to predictive process monitoring on event graphs. ProcessGFM adopts a hierarchical architecture that jointly encodes event-level, case-level, and resource-level structure. It is pretrained with self-supervised objectives that are specific to event logs and then adapted through a multi-task prediction head to next-activity prediction, remaining time regression, and risk classification. An adversarial domain alignment module encourages the model to learn process-agnostic representations that transfer across logs. Extensive experiments on three BPI Challenge logs demonstrate that ProcessGFM achieves consistent improvements over strong baselines and maintains a high level of performance in cross-log scenarios.
The main contributions of this work are as follows.
We introduce a domain-specific graph pretraining prototype for predictive process monitoring. The model performs multi-log self-supervised pretraining and supports several downstream prediction tasks within a unified backbone, but its scale is intentionally moderate (four public logs) compared with universal graph foundation models [
16].
We develop a hierarchical graph representation that jointly models events, cases, and resources, complemented by temporal encodings that respect both control-flow structure and time-dependent behaviour.
We design a multi-task adaptation mechanism together with an adversarial domain alignment module. These components jointly promote strong predictive performance and enhance robustness when transferring across heterogeneous event logs.
We provide a comprehensive empirical study on real-life event logs, including detailed ablation experiments and interpretability analysis in which high-attention subgraphs are projected back to process models to expose behavioural bottlenecks.
2. Related Work
Related work falls into three strands: predictive process monitoring, graph-based process mining, and graph foundation models.
Predictive process monitoring combines process mining with predictive analytics. Early studies extracted features from running traces and trained classical classifiers and regressors to predict the next activity or the remaining time [
17,
18]. Later, recurrent models such as LSTM networks and gated recurrent units were used to encode prefixes, leading to higher accuracy in both next-activity and remaining-time tasks [
4,
19]. More recent studies explore convolutional and Transformer-based architectures and sophisticated positional encodings along the trace [
7,
20,
21,
22]. These methods, however, almost always treat each trace as a linear sequence, with limited structural information beyond positional indices.
Graph-based process mining attempts to incorporate the concurrent and branching nature of processes. Graph convolutional networks have been applied to place graphs derived from Petri nets to interpret recurrent models and to assist in process discovery [
23,
24]. Directly follows graphs have been used as input to graph neural networks for predictive tasks and process discovery, with recent work exploring different variants of the underlying graph representation [
25,
26,
27,
28]. Object-centric predictive monitoring represents event logs as heterogeneous graphs linking events, cases, and objects and uses graph attention networks to encode these relations [
29]. Heterogeneous graph neural networks have also been proposed for explainable predictive monitoring, where attention weights highlight critical activity and resource combinations [
30]. Related work on heterogeneous graph contrastive learning further emphasises multi-scale, attribute-aware views [
31], which is conceptually aligned with our use of multiple structural signals but tailored to event graphs and control-flow- and outcome-specific objectives. These models show the benefit of graph structure but remain limited to single logs and specific tasks.
Graph foundation models aim to bring the success of large pretrained models to the graph domain. Surveys distinguish between universal, task-specific, and domain-specific graph foundation models and analyse their backbone architectures, pretraining tasks, and adaptation mechanisms [
16]. Pretraining strategies include masked node and edge prediction, contrastive learning across graph views, subgraph prediction, motif prediction, and structural encoding [
32,
33]. Tutorial articles discuss the challenges of structural alignment, scalability, and evaluation across heterogeneous graphs [
34]. To the best of our knowledge, no prior work has introduced a graph foundation model specialised for predictive process monitoring on event graphs. Existing graph-based process mining methods either focus on process discovery or predictive tasks on a single log and do not exploit cross-log pretraining. Following the taxonomy of Liu et al. [
16], ProcessGFM should therefore be viewed as a domain-specific graph pretraining prototype rather than a universal graph foundation model: it focuses on predictive process monitoring, is pretrained on a small number of business-process logs, and is evaluated on a restricted set of tasks.
3. Preliminaries and Problem Formulation
An event log is a finite collection of traces that record the execution of a business process. We denote an event log by
where each trace
corresponds to a single process instance or case.
Each trace
is represented as a finite sequence of events
with
as the length of case
i. Every event
is described by a tuple of attributes
where
is the activity label,
is the timestamp,
is the resource identifier, and
collects additional numerical or categorical attributes [
35].
Here we explicitly distinguish between “activity” and “event”. An activity denotes a process-level action type (i.e., a categorical label in the activity set ), while an event is a concrete execution instance of an activity within a specific case, associated with a timestamp, a resource, and optional attributes. Thus, multiple events across different cases (or within the same case) may share the same activity label. In other words, activity is the type-level concept, whereas event is the instance-level record.
For a fixed trace
, we define the prefix of length
k as
The next activity label associated with this prefix is
and the remaining time of the case at prefix
is defined as
Risk labels are defined at the case level by a binary variable
, indicating whether case
will lead to an undesired outcome. For simplicity, every prefix
inherits the eventual outcome of its case so that
. In order to exploit structural information, we construct for each log
L a heterogeneous event graph
where
V is the set of nodes,
is the set of directed edges,
assigns node types, and
assigns edge types. We consider three node types,
where
denotes event nodes,
case nodes, and
resource nodes. For each event
, we create an event node
; for each case
, a case node
; and for each resource
r, a resource node
.
Edges encode control-flow, case membership, and organisational relations. Control-flow edges connect successive events within a trace, case membership edges connect events to their case, and resource assignment edges connect events to their executing resources. Organisational edges connect resources that collaborate on the same case within a fixed time window
. We write the full edge set as
where the four subsets correspond to these relations. Edge types are determined by membership in these subsets via
.
Example 1 (Trace-to-graph construction)
. Consider a trace from BPI2012 with activities , , and . We create three event nodes , , , one case node , and one resource node for each distinct resource r executing any of these events. Control-flow edges connect successive events, i.e., and . Case membership edges link each event to its case, for , and resource assignment edges connect events to their executor, . If two resources r and co-occur in the same case within a time window Δ
, we additionally add organisational edges . Figure 1 provides a schematic illustration of this trace-to-graph construction. For a prefix
of trace
, we define the induced prefix subgraph
where
contains the case node
, the event nodes
for
, the corresponding resource nodes, and all edges between these nodes inherited from
G.
The predictive process monitoring tasks considered in this work can be formulated as learning conditional distributions over labels given prefix subgraphs. The next-activity prediction problem is to approximate
the remaining time prediction problem is to approximate
and the risk prediction problem is to approximate
We are given a finite collection of event logs
with corresponding event graphs
for
. Our goal is to learn a single parameterised backbone mapping
which assigns to every prefix subgraph a representation vector
. On top of this backbone, a collection of task-specific heads
defines composite predictors
,
, and
. In addition to minimising prediction error on individual logs, we require that the backbone
generalises across logs, in the sense that its performance under domain shifts between
and
remains stable when evaluated on prefixes from
not seen during fine-tuning.
4. The ProcessGFM Model
This section describes the ProcessGFM architecture, its self-supervised pretraining objectives, the multi-task adaptation mechanism, and the domain alignment module. An overview of the framework is shown in
Figure 2.
4.1. Hierarchical Graph Backbone
For each node
in an event graph
G, we construct an initial feature vector by concatenating type-specific embeddings and numerical attributes and projecting them to a common dimension. Formally, we use a type-dependent embedding function
and set
where
contains the raw attributes of node
v and
is its initial hidden representation.
The hierarchical backbone consists of
L stacked message-passing layers. At layer
, each node
v aggregates messages from its neighbourhood. Let
denote the set of in-neighbours of
v, and let
be an edge feature vector associated with edge
. The unnormalised attention coefficient from node
u to node
v at layer
ℓ is defined as
where
and
are type-specific weight matrices,
is a shared edge weight matrix,
is a nonlinearity,
is an edge-type-specific attention vector, and
denotes concatenation.
The normalised attention coefficient is obtained through a softmax over the neighbourhood:
The aggregated message at node
v is then
The updated hidden state of node
v is computed via a residual transformation and normalisation,
where
is a type-specific projection and
is a nonlinear activation function.
After
L layers, we obtain final node representations
for all nodes. For a prefix subgraph
, we construct a prefix-level representation by aggregating the representations of the case node
and the event nodes corresponding to the prefix
. Let
denote this set of nodes. We define attention weights on
as
where
is a learned query vector. The prefix representation is then
and serves as the shared foundation representation for all downstream predictive tasks.
4.2. Self-Supervised Pretraining
The backbone parameters
are initialised through self-supervised pretraining on all logs in
, using three complementary objectives.
Figure 3 summarises these pretraining tasks.
For masked activity reconstruction, we select for each graph
G a random subset
of event nodes and replace their activity labels by a special mask token. Let
denote the original activity label of event node
v. The masked activity loss is the cross-entropy
where
denotes the graph with masked labels and
is induced by a softmax layer applied to node representations.
For temporal order consistency and pseudo outcome prediction, we adopt analogous cross-entropy objectives defined on subsequences and case-level representations, respectively. Denoting their losses by
and
, the overall pretraining objective is a weighted sum
with non-negative weights
. Minimising
over all logs in
yields a backbone that captures control-flow regularities, temporal constraints, and coarse outcome information.
The weights
are chosen to balance these three sources of supervision. To assess how sensitive ProcessGFM is to this choice, we performed a small grid search over
while fixing
and evaluated next-activity accuracy, remaining-time MAE, and risk AUC on BPI2017 and BPI2019; detailed values are reported in
Table A8 and
Table A9 in
Appendix B. Variants that use only masked activity reconstruction
consistently underperform variants that also include temporal order or pseudo-outcome objectives, confirming that all three tasks contribute a complementary signal. The configuration
, which we use in all other experiments, achieves a favourable trade-off between the three downstream metrics on both logs, while neighbouring settings such as
or
yield very similar results. This indicates that ProcessGFM is not overly sensitive to moderate changes in the pretraining loss weights.
4.3. Multi-Task Prediction Head
After pretraining, the backbone is adapted to predictive monitoring tasks by attaching a multi-task head to the prefix representations
. The next-activity predictor maps
to a probability distribution over the activity set
:
where
and
are activity-specific parameters. The risk predictor outputs a probability
where
,
, and
is the logistic function. Remaining time is predicted by an affine mapping in
, which we denote by
.
Given a set of labelled prefixes
, we define task-specific losses
,
, and
based on cross-entropy and mean absolute error. The total multi-task adaptation loss is
with non-negative weights
that control the relative importance of the tasks.
4.4. Domain Alignment
Event logs from different organisations and domains may have disjoint activity labels, distinct control-flow structures, and heterogeneous resource behaviours. To encourage the backbone to learn representations that are robust under such domain shifts, ProcessGFM employs an adversarial domain alignment module.
For every prefix
from log
, we associate a domain label
. A domain classifier
maps the prefix representation
to a probability distribution over domains:
where
and
are domain-specific parameters and
denotes the collection of these parameters. The domain classification loss
is the average cross-entropy between
and
.
Domain alignment is achieved by solving the minimax problem
where
controls the strength of alignment. In practice, this objective is implemented via a gradient reversal layer that multiplies the gradient of
with respect to
by
during backpropagation, thereby encouraging the backbone to produce representations that are informative for the predictive tasks while being as invariant as possible with respect to the domain labels.
To analyse the stability of adversarial domain alignment, we varied the alignment strength
on BPI2017 and evaluated both in-domain performance and cross-log transfer to BPI2019; full numerical results are provided in
Table A10 in
Appendix B. Values in the range
–
were numerically stable and consistently improved cross-log robustness over
, with
providing the best trade-off between task performance and invariance. For larger values (e.g.,
) we observed oscillations in the domain loss and a slight degradation of downstream metrics, which is consistent with known issues in minimax training. In all experiments, we therefore fix
, use a smaller learning rate for the domain classifier, and apply gradient clipping to ensure stable training.
5. Datasets and Experimental Setup
We evaluate ProcessGFM on three real-life event logs from the BPI Challenges and on a synthetic hospital log used in prior work on predictive monitoring benchmarks [
36].
Table 1 summarises the main characteristics of these logs.
The BPI2012 log records a loan application process at a Dutch financial institution and has been widely used in predictive monitoring experiments [
36,
37]. The BPI2017 log describes loan offers in an updated information system of the same institution [
35]. The BPI2019 log captures the purchase order handling process at a multinational coatings and paints company and is notable for its larger number of cases and events. The hospital log simulates patient flows in a medium-sized hospital and is taken from an existing benchmark suite [
36].
Across all logs, the resulting heterogeneous event graphs are extremely sparse: the overall directed edge density
is on the order of
. Each event node has on average between 5.0 and 5.7 incident edges, reflecting incoming and outgoing control-flow links, case membership, resource assignment, and a small number of organisational edges. These statistics confirm that the graphs remain tractable in size despite the large number of events and that sparsity can be exploited by graph neural network architectures.
Table 2 summarises the corresponding node and edge counts for these heterogeneous event graphs.
For each log, we follow standard practice and split cases into training, validation, and test sets in proportions 70%, 10%, and 20%. Prefixes are generated by cutting each trace at every event except the last and using the next event as the target for next-activity prediction. Remaining time is defined as the difference in hours between the timestamp of the last event and the last observed event in the prefix. Risk labels are derived from trace-level outcomes: cases exceeding the 80th percentile of completion time or marked as undesirable according to domain attributes are labelled as high risk.
We pretrain ProcessGFM on the union of all four logs using full traces and the three self-supervised objectives for all in-domain experiments. For the leave-one-log-out cross-log experiments, we additionally train separate backbones where the eventual test log is excluded from the pretraining corpus. Pretraining runs for 50 epochs with early stopping based on validation loss on a held-out subset of prefixes. For adaptation, we fine-tune separate multi-task heads and domain classifiers for each log while keeping the backbone shared. Training uses the Adam optimiser with a learning rate of for the heads and for the backbone, mini-batches of 64 prefixes, and a maximum prefix length of 100 events.
As baselines, we consider a strong LSTM model, a Transformer with positional trace encoding, a gated graph sequence neural network (GGSNN) defined on directly follows graphs, and a heterogeneous graph neural network for predictive monitoring inspired by recent work on explainable PPM [
30,
38,
39,
40,
41]. All baselines are tuned to reach competitive performance on validation data. To make comparison fair, all models have between 3 and 6 million trainable parameters.
Hyperparameters for all models are selected via validation-based random search. For each baseline and for ProcessGFM, we sample between 24 and 32 configurations from predefined ranges (hidden dimension , number of layers , dropout , and learning rate ) on BPI2012 and select the configuration with the best average validation performance across next-activity and remaining-time tasks. The same configuration is then reused on the other logs without further tuning. We also observe that moderate variations around the selected configuration (e.g., layers or hidden units) change all metrics by less than 1.5 percentage points, which suggests that our conclusions are robust to reasonable hyperparameter choices.
Table 3 reports model sizes and training time. ProcessGFM has a larger backbone and requires a one-time pretraining cost, but fine-tuning time per log remains comparable to that of single-log graph baselines.
6. Results and Analysis
This section presents quantitative results for next-activity prediction, remaining-time prediction, and risk prediction, followed by cross-log transfer experiments, ablation studies, and qualitative interpretability analysis.
6.1. Next-Activity Prediction
Table 4 summarises next-activity prediction performance. We report accuracy and macro-F1 on the test sets of each log.
ProcessGFM achieves the best performance on all three logs, with substantial improvements over the strongest baseline. On BPI2012, it improves accuracy by 2.7 percentage points and macro-F1 by 3.9 points compared with the HeteroGNN PPM model. On BPI2017, the gains are 3.8 and 4.3 points, and on BPI2019 they are 4.5 and 5.7 points. Gains in macro-F1 are particularly important because they indicate better performance on less frequent activities, which are often the most critical from an operational perspective.
Figure 4 illustrates how accuracy evolves with prefix length on BPI2017. For each model, prefixes are grouped by length into intervals of five events, and average accuracy is reported. At short prefixes, sequence-based models perform comparatively well because the control-flow structure is simple. As prefixes grow longer and the process becomes more complex, ProcessGFM maintains a higher accuracy than all baselines, with a gap of up to 8.2 percentage points at prefixes between 30 and 35 events. This behaviour confirms that the hierarchical graph backbone preserves useful information in long traces.
6.2. Remaining-Time Prediction
Table 5 reports remaining-time prediction results. We use mean absolute error (MAE) and root mean squared error (RMSE), measured in days, together with the coefficient of determination
.
Table 6 reports remaining-time prediction on BPI2019 and Hospital.
Table 5 and
Table 6 show that across all logs, ProcessGFM yields the lowest MAE and RMSE and the highest
. On BPI2012, it reduces MAE by 0.28 days compared with HeteroGNN PPM, corresponding to a relative reduction of 18.2%. On BPI2017, the reduction is 15.2%; on BPI2019, it is 11.7%; and on the hospital log, it is 14.6%. These improvements are consistent with the intuition that remaining-time prediction benefits from jointly modelling control-flow, resources, and temporal attributes in a graph.
Figure 5 plots MAE as a function of prefix length for BPI2019. At very short prefixes, uncertainty is inherently high and all models show similar MAE. As more events are observed, ProcessGFM reduces MAE more quickly than baselines, converging to roughly two days of average error near completion, while sequence models remain above 2.4 days.
6.3. Risk Prediction
We next consider case-level risk prediction.
Table 7 shows area under the ROC curve (AUC), F1-score, and precision at 10% recall (P@10%) for identifying high-risk cases.
Table 7 and
Table 8 show that ProcessGFM delivers the highest AUC and F1 across all logs. The most pronounced improvement is in P@10%, which measures the precision of the top fraction of cases flagged as high risk. On BPI2012, ProcessGFM raises P@10% from 82.7% to 89.4%, and on BPI2019 from 79.8% to 87.9%. These gains are particularly relevant in operational settings where only a small proportion of cases can be inspected manually.
Figure 6 plots ROC curves for BPI2017, showing a consistent improvement in the true positive rate across the full range of false positive rates. The area under the ProcessGFM curve is 0.907 compared with 0.871 for the strongest baseline.
6.4. Cross-Log Transfer
To assess cross-log transfer, we simulate a setting where a model is fine-tuned on one log and evaluated without further adaptation on another.
Table 9 reports next-activity accuracy for four pairs of logs. For each pair, the row indicates the training log and the column the test log.
In this leave-one-log-out setup, the ProcessGFM backbone is pretrained only on the three logs that are different from the eventual test log. ProcessGFM retains between 92% and 96% of its in-domain next-activity accuracy when evaluated zero-shot on a different log (e.g., 79.1% vs. 83.7% when transferring from BPI2017 to BPI2019), whereas the HeteroGNN baseline often falls below 85% of its in-domain performance. These results indicate that the self-supervised pretraining and adversarial domain alignment help the backbone capture patterns that are stable across processes. We further report cross-log remaining-time and risk prediction when training on BPI2017 and testing on BPI2019 in
Table A5, where ProcessGFM exhibits a smaller degradation than the HeteroGNN baseline.
Figure 7 provides an additional perspective by visualising case-level embeddings from ProcessGFM using two-dimensional t-SNE. Embeddings from BPI2012, BPI2017, and BPI2019 are plotted together and coloured by log. The clusters for different logs partially overlap, and within each log, high-risk and low-risk cases form discernible subclusters, suggesting that the backbone organises cases in a way that reflects both domain and outcome while preserving common structure.
6.5. Effect of Pretraining Corpus Size
To investigate how the size of the pretraining corpus influences downstream performance, we compare four configurations. In , the backbone is pretrained only on BPI2017; in , it is pretrained on BPI2012 and BPI2017; in , it is pretrained on BPI2012, BPI2017, and BPI2019; and in , it is pretrained on all four logs (our default setting). In all cases, the same multi-task head and fine-tuning procedure are used on BPI2019 and the hospital log.
Enlarging the pretraining corpus yields consistent but moderate gains in downstream accuracy on both BPI2019 and Hospital. Increasing the number of logs from one to three leads to a clear improvement, while adding the fourth log brings additional but smaller gains, suggesting a saturating scaling curve. This behaviour is in line with the intuition that pretraining on a broader set of processes provides more robust graph representations, while also illustrating that our study operates at a moderate scale compared with universal graph foundation models trained on orders of magnitude more graphs. Complete scores, including AUC values, are reported in
Table A7 in the appendix.
6.6. Ablation Study
To understand the contribution of each component of ProcessGFM, we conduct an ablation study on BPI2017.
Table 10 reports next-activity accuracy, remaining-time MAE, and risk AUC for variants where we remove one component at a time.
Removing any self-supervised objective degrades performance. Masked activity reconstruction appears particularly important for next-activity accuracy, while temporal order consistency and pseudo outcome prediction mainly benefit remaining-time and risk prediction. Eliminating the domain alignment module reduces cross-log transfer performance even more strongly than in-domain metrics. Using separate single-task heads instead of a multi-task head also harms all three tasks, confirming that shared representations help. The largest drop occurs when removing case and resource nodes and using an event-only graph, which emphasises the value of the hierarchical representation.
In addition to these component-wise ablations, we analyse how the strength of the self-supervised objectives affects downstream performance by varying the pretraining loss weights. Full numerical values are listed in
Table A8 and
Table A9 in
Appendix B. Using only masked activity reconstruction
yields the weakest results, while adding either temporal order or pseudo-outcome prediction clearly improves all three metrics. The configuration
provides a strong overall trade-off and coincides with the best or near-best performance across tasks, but neighbouring settings such as
and
differ by at most about one percentage point in accuracy and by less than 0.05 days in MAE. This pattern suggests that all three self-supervised tasks are beneficial and that ProcessGFM is robust to moderate variations in their relative weighting.
Our sensitivity analysis indicates that ProcessGFM remains robust to moderate variations of the pretraining and alignment weights, suggesting that a simple manual setting is feasible for the current multi-log benchmark. However, manual selection may still be suboptimal when transferring to unseen processes with different control-flow complexity, resource heterogeneity, or outcome characteristics. A promising direction is to develop adaptive hyperparameter tuning mechanisms that automatically balance heterogeneous self-supervised objectives and domain alignment during training, for example via gradient- or uncertainty-aware loss balancing and stage-wise curricula. Such adaptive loss balancing has shown benefits in related structured learning settings and may be particularly valuable for event-graph pretraining with diverse logs [
42,
43].
Figure 8 visualises the ablation results as a bar chart with three panels for the three metrics. The full model consistently achieves the best scores, while the event-only variant is notably weaker.
Figure 9 complements
Figure 8 by visualising how the self-supervised loss weights and domain alignment strength affect next-activity accuracy, remaining-time MAE, and risk AUC on BPI2017 and BPI2019.
6.7. Backbone Size and Training Cost
We further examine how the size of the ProcessGFM backbone affects performance and computational cost.
Figure 10a plots next-activity accuracy on BPI2017 and BPI2019 as a function of the number of parameters for a family of backbones ranging from 3.9 M to 11.8 M parameters. As the model grows from Small to Large, accuracy on both logs increases monotonically but with diminishing marginal gains, while pretraining time grows almost linearly, indicating a clear trade-off between performance and efficiency. Complete numerical results, including MAE and AUC values, are reported in
Table A6 in
Appendix B.
6.8. Inference Cost and Deployability
We also measure the inference latency of all models on BPI2019.
Figure 10c relates GPU and CPU latency to the parameter budget for LSTM, Transformer, HeteroGNN and ProcessGFM. As expected, the LSTM baseline is the fastest with the smallest parameter budget, while ProcessGFM occupies the upper-right corner of the plot with the highest accuracy but still moderate GPU latency around 3.8 ms per prefix. On CPU, latency increases for all models but remains within a range that is compatible with near real-time monitoring. Detailed GPU/CPU latencies and throughput are reported in
Table A4 in
Appendix B. In all cases, heterogeneous graphs are constructed offline from the event logs; for streaming scenarios, the graph can be updated incrementally as events arrive, so graph construction is not on the critical path of online prediction.
6.9. Interpretability and Case Study
To illustrate how ProcessGFM links graph patterns to predictions, we examine a late case from BPI2019 with a completion time in the 95th percentile.
Figure 11 shows the subgraph induced by this case in a directly follows graph, with node colours indicating activity types and edge thickness proportional to attention weights assigned by ProcessGFM during remaining-time prediction. The model focuses on a sequence of repeated change and approval activities involving the same resource cluster, which corresponds to a rework loop identified in process discovery [
23].
Mapping high-attention edges back to Petri net fragments reveals that the model concentrates on transitions known to cause bottlenecks.
Figure 12 displays a Petri net fragment with highlighted transitions and places. This correspondence suggests that the attention mechanism learns to identify structural causes of delay and supports explanation to process analysts.
7. Discussion
The experiments demonstrate that ProcessGFM achieves consistent improvements over strong baselines across three predictive monitoring tasks and multiple real-life event logs. The gains are larger for complex logs with many activities, long traces, and heterogeneous resources, which aligns with the intuition that graph structure and cross-log pretraining deliver greater benefits when event behaviour is highly variable. The results also suggest that large graph pretraining models can meaningfully shift process mining from log-specific architectures toward more unified graph-based representations capable of supporting broad generalisation.
From a methodological perspective, several observations emerge. First, domain-specific self-supervised pretraining tasks that exploit both control-flow dependencies and temporal regularities clearly strengthen downstream performance. The three objectives employed in ProcessGFM contribute complementary information: masked activity reconstruction captures local behavioural patterns, temporal order consistency enforces sensitivity to execution constraints, and pseudo-labelled outcomes encourage global case-level discrimination. Their combined effect indicates that the structure of event logs is sufficiently rich to support large-scale representation learning beyond traditional token-level objectives, and the loss-weight sensitivity analysis in
Figure 9a,b,d (with full numerical values in
Table A8 and
Table A9 in
Appendix B) shows that these benefits are stable across a reasonable range of pretraining weight configurations. Second, modelling events, cases, and resources within a single hierarchical graph proves advantageous. This design allows the backbone to integrate short-range event transitions with longer-range case evolution and contextualise them through resource interactions, producing embeddings that are not easily replicated by models confined to a single representational layer. The ablation study shows that removing any of these components degrades performance, with the largest impact observed when the model is restricted to event-only graphs.
Domain alignment strengthens cross-log robustness even when differences in activity vocabularies or organisational structures are substantial. By discouraging the backbone from encoding log-specific artefacts, the adversarial module improves stability across logs and reduces the need for per-log fine-tuning. This is particularly relevant in practical deployments where organisations may wish to train a model once and reuse it across multiple processes.
Despite these strengths, several limitations remain. The experimental logs, while diverse, are all derived from structured administrative processes with relatively clear control-flow patterns. It is plausible that more irregular settings such as customer service interactions, object-centric manufacturing logs, or multi-entity event streams may pose additional challenges, especially where entities are loosely coupled or events occur asynchronously.
Furthermore, while the pretraining corpus spans multiple years of public benchmarks, its scale remains modest compared with emerging graph foundation models trained on millions of graphs. Incorporating larger corpora, including semi-synthetic or augmented data, may enable further gains and better capture rare behavioural patterns. The pretraining scale ablation in
Table A7 and the hyperparameter study in
Figure 9 indicate that enlarging the corpus from one to four logs yields consistent but gradual improvements and that performance is stable across a reasonable range of self-supervised and alignment weights, which supports viewing ProcessGFM as a moderate-scale, domain-specific prototype rather than a universal model. Another limitation concerns interpretability. Attention-based visualisation provides a useful indication of which subgraphs influence model predictions, yet these explanations remain heuristic and depend on internal weighting mechanisms. Integrating the learned graph representations with formal conformance checking, structural simplification, or region-based process discovery could lead to more rigorous, semantically grounded interpretability. Finally, although ProcessGFM performs well on zero-shot transfer across similar domains, entirely new processes with radically different vocabularies or execution cultures may still require adaptation strategies tailored to open-world settings.
An additional asymmetry arises when considering the synthetic Hospital log. ProcessGFM attains strong in-domain risk prediction performance on both BPI2019 (AUC 0.912) and Hospital (0.921), yet transferring from the real BPI2019 log to Hospital yields a slightly lower AUC (0.894) than transferring in the opposite direction (0.902;
Table A3). We attribute this behaviour to the “clean” and simplified control-flow of the synthetic benchmark, which does not fully reflect the noise, exceptional behaviour, and irregular resource usage patterns present in the real logs. Representations learned from real logs must account for such irregularities and therefore do not align perfectly with the more regular synthetic behaviour, whereas representations learned on the synthetic log focus on generic patterns that still carry over to real logs. The domain alignment module mitigates, but does not completely remove, this distribution shift, which we consider an interesting direction for future work on robustness across synthetic and real-world benchmarks.
Overall, the findings suggest that ProcessGFM represents a meaningful step toward more general, transferable, and structurally aware predictive monitoring models. The results indicate that combining hierarchical graph reasoning, targeted self-supervision, and domain alignment offers a viable pathway for future research aiming to bridge the gap between traditional process modelling and modern graph representation learning. In addition, adaptive loss balancing and automated hyperparameter scheduling may further reduce manual tuning and improve robustness across processes with heterogeneous control-flow structure and outcome distributions.
8. Conclusions
This paper has introduced ProcessGFM, a domain-specific graph pretraining prototype for predictive process monitoring on event graphs. The model combines a hierarchical graph backbone that integrates event-level, case-level, and resource-level information with self-supervised pretraining, multi-task adaptation, and adversarial domain alignment. Evaluations on three BPI logs and a hospital log show that ProcessGFM improves next-activity prediction, remaining-time estimation, and risk classification compared with strong sequence-based and graph-based baselines. It also demonstrates robust cross-log transfer and provides interpretable insights by highlighting process fragments associated with delay and risk.
Future work includes extending ProcessGFM to object-centric event logs and multi-process settings, integrating text attributes through joint training with language models, and exploring more scalable pretraining regimes inspired by universal graph foundation models. Another promising direction is to couple ProcessGFM with automated process improvement techniques, where the model not only predicts outcomes but also suggests interventions.