Next Article in Journal
A Fully Decentralized Web Application Framework with Dynamic Multi-Point Publishing and Shortest Access Path
Previous Article in Journal
A Systematic Literature Review on Serious Games Methodologies for Training in the Mining Sector
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Process Tree-Based Incomplete Event Log Repair Approach

1
School of Computer Science and Technology, Shandong University of Technology, Zibo 255049, China
2
School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266520, China
3
Jinan Inspur Technology Co., Ltd., Jinan 250101, China
4
NOVA Information Management School, Nova University of Lisbon, 1099-085 Lisbon, Portugal
*
Author to whom correspondence should be addressed.
Information 2025, 16(5), 390; https://doi.org/10.3390/info16050390
Submission received: 2 April 2025 / Revised: 1 May 2025 / Accepted: 6 May 2025 / Published: 8 May 2025

Abstract

:
The low quality of business process event logs—particularly the widespread occurrence of incomplete traces—poses significant challenges to the reliability, accuracy, and efficiency of process mining analysis. In real-world scenarios, these data imperfections severely undermine the practical value of process mining techniques. The primary research problem addressed in this study is the inefficiency and limited effectiveness of existing Petri-net-based incomplete trace repair approaches, which often struggle to accurately recover missing events in the presence of complex and nested loop structures. To tackle these limitations, we aim to develop a faster and more accurate approach for repairing incomplete event logs. Specifically, we propose a novel repair approach based on process trees as an alternative to traditional Petri nets, thus alleviating issues such as state space explosion. Our approach incorporates process tree model decomposition and innovative branch indexing techniques, enabling rapid localization of candidate branches for repair and a significant reduction in the solution space. Furthermore, by leveraging activity information within the traces, our approach achieves efficient and precise repair of loop nodes through a single traversal of the process tree. To comprehensively evaluate our approach, we conduct experiments on four real-life and five synthetic event logs, comparing performance against state-of-the-art techniques. The experimental results demonstrate that our approach consistently delivers repair accuracies exceeding 70%, with time efficiency improved by up to three orders of magnitude. These findings validate the superior accuracy, efficiency, and scalability of the proposed approach, highlighting its strong potential for practical applications in business process mining.

1. Introduction

Business process execution inevitably generates large volumes of event logs, which capture actual execution traces and serve as a fundamental data source for process mining and analysis [1,2]. While substantial progress has been achieved in process mining, such as complex event stream processing [3], frequent pattern mining [4], and source analysis [5], there remains a notable gap in search addressing the quality issues inherent in event logs.
A particularly critical quality issue is the presence of incomplete traces within event logs, which significantly undermines the accuracy and reliability of process mining results [6]. Incomplete traces often arise from various factors, primarily including human errors, system malfunctions and deviations caused by data collection and integration in heterogeneous execution environments. For instance, in the production process of a manufacturing enterprise, a complete process instance should record the entire process from “raw material procurement”, “production planning”, “manufacturing”, “quality inspection” to “finished product storage”. Nonetheless, due to system failures or human errors, certain activities may not be executed or recorded. For example, a trace generated by a batch of products is recorded as ⟨raw material procurement, production planning, quality inspection, finished product storage⟩, with the activity “manufacturing” omitted.
Such omissions not only hinder accurate identification of production bottlenecks, but also lead to erroneous assessments of activity efficiency. These inaccuracies can adversely affect enterprise resource optimization, overall operational efficiency and delivery timelines [7,8]. Furthermore, incomplete traces complicate process compliance verification by disrupting temporal and sequential information, thereby making it challenging to detect non-compliant decision paths. Therefore, repairing incomplete traces and enhancing event log quality are crucial for achieving more reliable business process analysis [9].
The main research problem addressed in this study is that existing Petri-net-based incomplete trace repair approaches exhibit low efficiency and often fail to accurately recover missing events, especially when dealing with complex and nested loop structures. These approaches typically rely on multiple traversals or simplistic repetition-based logic to handle loops, which proves inadequate for capturing intricate control-flow constructs involving concurrency or choice. For example, Song et al. [10] utilized process decomposition and trace segmentation techniques to reduce the search space. However, this approach did not fundamentally improve the alignment techniques, which lead to less-than-optimal time efficiency. Similarly, Wang et al. [6] proposed a gap search approach that requires frequent searches for gaps between places and transitions, which is also inefficient. To improve efficiency, Song et al. [11] further employed process decomposition to break down choice structures. Although these approaches consider loop structure processing, they typically depend on multiple traversals for repairs or determine the number of loop executions based on event repetition counts [12]. They fail to adequately address complex loop structures, particularly those involving concurrency or choice constructs, leading to prolonged repair times or inaccurate results.
To address these challenges, we aim to develop a faster and more accurate approach for repairing incomplete event logs, focusing on improving both efficiency and correctness in scenarios involving complex loop structures. Specifically, this study proposes a process tree-based incomplete trace repair (ITR) approach, which replaces traditional Petri nets with process trees to alleviate problems such as state space explosion. By introducing process tree decomposition algorithms and branch indexing techniques, our approach can swiftly locate candidate branches for repair and significantly reduce the solution space. Moreover, by utilizing activity information embedded in traces to determine loop execution counts, our approach achieves efficient and accurate repair of loop nodes through a single traversal of the process tree.
The main contributions of this work are summarized as follows: (1) efficiency improvement: we leverage the process tree structure to simplify process representation and computation, thus overcoming inefficiencies in Petri-net-based repair approaches; (2) solution space optimization: we introduce process model decomposition and branch indexing strategies to avoid unnecessary traversal of irrelevant branches, further enhancing repair efficiency; (3) accurate handling of complex loops: we design a novel repair mechanism capable of precisely repairing nested and complex loop structures in a single pass, ensuring both high accuracy and scalability.
The rest of this paper is organized as follows: Section 2 reviews existing work on repairing incomplete traces; Section 3 provides a detailed description of the proposed approach in this paper for repairing incomplete traces; Section 4 evaluates the accuracy and efficiency of the proposed approach; Section 6 concludes this paper.

2. Related Work

Existing approaches for incomplete traces repair are mainly presented and grouped into four main categories, including process-model-based, interpolation-based, neural-network-based, and trace-clustering-based approaches.

2.1. Process-Model-Based Approach

Process-model-based repair approaches leverage known process models, such as Petri nets, as a reference framework to fill in missing activities based on the logical constraints of the model. The primary advantage of these approaches is their ability to utilize the prior knowledge inherent in the process model, ensuring that repair results are consistent with the underlying business logic. Wang et al. [6] proposed a gap-search-based repair strategy, but this approach encounters efficiency challenges due to the frequent need to query gaps between the current place and subsequent transitions. To mitigate this issue, Song et al. [11] introduced process decomposition techniques to enhance efficiency by breaking down choice structures, though this approach is mainly effective for simpler loop structures. In subsequent research, Song et al. [10] refined the traditional alignment approach through process decomposition and trace segmentation techniques, which effectively reduced the search space and expedited the repair process. Nonetheless, this approach still traverses some irrelevant branches and does not fundamentally overcome the limitations of the original alignment approach, resulting in relatively low overall efficiency. Additionally, Song et al. [12] combined log clustering with submodel mining techniques to facilitate trace repair. However, when dealing with loop structures, the approach relies only on comparing activity repetition frequencies within traces, leading to inaccurate repair results for complex loop structures.

2.2. Interpolation-Based Approach

Interpolation approaches rely on statistical inference techniques to complete missing information by examining the relationships between attributes or activities in event logs. Rogge-Solti et al. [13] introduced an approach that utilizes path probabilities and Bayesian networks to repair missing activities and timestamps in logs. By combining probabilistic models and statistical inference, this approach can effectively fill in missing activities and their corresponding timestamps to a significant extent. Sim et al. [14,15] leveraged cardinality relationships among event attributes for interpolation-based repair. However, the success of these approaches heavily depends on the availability of sufficient cardinality relationships and the richness of attribute data in the logs. If the logs lack such relationships or if the attribute data are sparse, the performance of these approaches will be limited.

2.3. Neural-Network-Based Approach

Neural-network-based approaches leverage deep learning models to identify complex behavioral patterns within event logs and reconstruct missing activities. Nguyen et al. [16,17] utilized autoencoders to restore missing event attributes, as autoencoders possess the capability to learn intricate, nonlinear relationships among attribute values without any prior knowledge. Lu et al. [18] employed long short-term memory (LSTM) networks to predict missing activity based on contextual sequences in the logs. Fang et al. [19] introduced an end-to-end multi-task repair model based on BERT, alongside the development of multi-perspective explainability algorithms. Wu et al. [20] investigated the application of masked Transformer models for repair tasks, utilizing contextual information from traces to predict masked activities and discern underlying behavioral patterns in event logs. Nonetheless, these approaches generally require prior knowledge of the precise locations of missing events. In scenarios where such locations are not pre-determined, manual annotation becomes essential, incurring substantial labor costs.

2.4. Trace-Clustering-Based Approach

Clustering and statistical approaches address the repair of missing activities by analyzing trace similarities and activity frequency distributions, exploiting the characteristics inherent to the clusters. Liu et al. [21] proposed an approach for constructing an activity relationship matrix based on direct succession relations and performed clustering on the logs. By comparing the similarity between incomplete traces and clustering results, appropriate activities were selected for repair. Fani et al. [22] repaired anomalous traces based on the frequency distribution of activities within the context. Xu et al. [23] introduced an activity label repair approach based on trace clustering, while Fang et al. [24] extended this approach by incorporating contextual probabilities to improve repair accuracy further. However, the effectiveness of these approaches diminishes significantly in event logs with substantial noise or extensive missing data.

3. Incomplete Event Log Trace Repair Approach

The proposed method for repairing incomplete event log traces is described and the implementation of the method is explained in detail.

3.1. An Approach Overview

The overview of the incomplete trace repair approach for event logs is illustrated in Figure 1. Event log A represents the complete event log, whereas event log B represents the log containing incomplete traces derived from the same business process. This approach consists of the following four steps:
Step 1: model discovery. Taking event log A as input, the process model is mined by the discovery algorithm (e.g., the Alpha algorithm [25], Heuristic Miner algorithm [26]), and then converted into the corresponding process tree structure.
Step 2: process tree decomposition. Taking the process tree as input, it is decomposed and merged into multiple subtrees based on node type, simplifying the subsequent repair process and enhancing repair efficiency.
Step 3: incomplete trace detection. Event log B is used as input, and incomplete trace detection approaches [27,28,29] are employed to identify and select incomplete traces.
Note that the focus of this paper is on incomplete trace repair, and a detailed discussion of incomplete trace detection approaches is beyond its scope.
Step 4: incomplete trace repair. Taking the decomposed subtrees and incomplete trace as input, a branch index is established to enhance search efficiency. Repair strategies are designed based on node types, and the minimal repair sequence that adheres to the process model constraints is identified within the process tree, resulting in the repaired complete trace.

3.2. Model Discovery

Accurate repair of incomplete traces often benefits from the availability of a corresponding process model, as such prior knowledge substantially enhances both the reliability and effectiveness of the repair procedure. In real-world business environments, process models are frequently accessible and can be leveraged to guide trace repair. However, in this study, all event logs utilized are publicly available logs that lack associated process models. To address this limitation, we employ the Inductive Miner (IM) algorithm [30] to automatically discover process models directly from the event logs.
The IM algorithm adopts a divide-and-conquer strategy [31], enabling efficient processing of large-scale logs while generating process models that are readily interpretable. Compared to alternative discovery techniques, the IM algorithm exhibits superior performance in capturing concurrent activities, and its output—a process tree—exhibits strong compatibility with subsequent process tree-based repair approaches.
Given the absence of ground-truth process models and uncertainty regarding the completeness of the logs, we reasonably assume that the event logs are sufficiently complete for the purposes of process discovery, that is, they do not exhibit systematic omissions or pervasive recording errors at the discovery step. The discovered process model subsequently serves as a reference standard throughout the trace repair procedure.
For example, consider an event log L with 10 traces, where L = [⟨a, b, c, g⟩1, ⟨a, c, d, f, g⟩4, ⟨a, b, d, e, g⟩2, ⟨a, c, d, e, g⟩2, ⟨a, b, d, e, f, g⟩1], and the superscript numbers represent the frequency of each trace in L. The process tree mined from log L using the IM algorithm is shown in Figure 2.

3.3. Process Tree Decomposition

In process mining and trace repair, the complex structures of process trees, particularly those with choice nodes, frequently result in substantial computational challenges. To enhance the efficiency of subsequent analysis and repair, process tree decomposition is introduced as an effective strategy. Decomposing choice nodes reduces the complexity of the process tree, improving the efficiency of the repair process.
The fundamental concept of the process tree decomposition algorithm, as outlined in [32], involves traversing the process tree to decompose and merge it based on its structural and semantic characteristics, thereby facilitating simplified process analysis and repair. The algorithm proceeds through several steps: Initially, choice nodes in the process tree are identified, and the process tree is decomposed into subtrees according to the decomposition operation, with each subtree representing a unique execution path. Subsequently, these subtrees are merged based on the semantics of sequential and concurrent nodes. Subtrees associated with sequential nodes are connected in their execution order, whereas those associated with concurrent nodes maintain their parallel execution. In cases where the root node is a loop node, decomposition is avoided to preserve the integrity of the subprocess tree. By decomposing complex process trees, the solution space for repair is reduced, enabling independent analysis and repair of each subtree, thereby increasing the efficiency of repairing incomplete traces.
Figure 3 shows the decomposition process of process tree (a). The root node of process tree (a) is a choice node. Based on the decomposition operation, process tree (a) is decomposed into two subtrees, which are then recursively decomposed. The left subtree consists of sequential nodes and activities a and b, which are merged to form process subtree (b). Meanwhile, the right subtree contains a loop node, which is retained as process subtree (c) without further decomposition, thereby preserving the integrity of the loop structure.

3.4. Incomplete Trace Repair

In this section, the branch index construction method based on process trees and the repair methods for each operation node are presented.

3.4.1. Branch Index Construction

In Section 3.3, we discuss the decomposition of choice node. To control the number and complexity of subprocess trees, only choice nodes with a path length of four or greater are generally retained, which results in some deep-level choice nodes remaining within the process tree. During subsequent trace repair, these nodes may require traversing irrelevant branches, thereby reducing repair efficiency. Since the process tree cannot contain duplicate activities with the same name, an effective optimization strategy for the repair process is to prune the choice branches that do not include input sequence activities. This approach not only reduces the solution space but also enhances repair efficiency. To implement this, a branch index is constructed on the process tree. This index facilitates the repaid identification of branches that align with the input sequence, effectively eliminating irrelevant branches and significantly boosting computational efficiency.
Given the process tree in Figure 4 and the trace to be repaired σ1 = ⟨a, b, f⟩, the repair condition requires that the intersection of the node’s index sequence with the trace is non-empty. The repair process proceeds as follows: Starting from the root node v0, we find the index sequence i = ⟨a, b, c, d, e, f⟩ σ1 = σ1, which satisfies the repair condition, allowing the traversal to continue to its child nodes. For v1.1, the index sequence i1 σ1 = ⟨a, b⟩ yields a partial repair result of ⟨a, b⟩. For v1.2, the index sequence i2 σ1 = ⟨f⟩, which satisfies the repair condition, prompts further traversal. For v2.3, the index sequence i21 σ1 = , indicating that branch (a) contains no activities from the σ1, leading to its pruning and improved efficiency. For v2.4, the index sequence i22 σ1 = ⟨f⟩ yields a partial repair result of ⟨e, f⟩. Finally, all partial repair results that satisfy the condition are combined in topological order to produce the final repair result σ 1 = ⟨a, b, e, f⟩.

3.4.2. Operation Node Repair

The process tree encompasses four types of operation nodes: sequential, concurrent, choice and loop nodes. When repairing incomplete traces, two crucial guiding principles are employed to ensure the rationality and efficiency of the repair results: the principle of minimal changes and the principle of equivalent repair. Based on these principles, specific repair approaches are designed for each type of operation node to optimize the overall repair process.
Strategy 1 (minimal change principle): The minimal change principle, extensively applied in data repair [33,34], focuses on minimizing alterations during the repair process. It assumes that error occurrences in real-world scenarios exhibit minimal disruption to existing data. In repairing incomplete event logs, this principle suggests that the number of missing events should be minimized, with the likelihood of losing multiple events being much lower than that of losing a single event.
Strategy 2 (equivalence repair principle): Concurrent activities in event logs can result in multiple repair solutions that conform to the minimal change principle. These differences are regarded as insignificant and are treated as equivalent repairs.
Building upon the two principles, we introduce a comprehensive repair method designed for different operational nodes within a process tree, encompassing sequential, concurrent, choice, and loop nodes. These methods are designed to ensure logical consistency and optimize computational efficiency throughout the repair process. Detailed explanations of the repair techniques specific to each type of operation node are provided.
  • Sequential Node
The child nodes of a sequential node must be executed in the order specified by the process tree. Therefore, when repairing a sequential node, it is essential to traverse the child nodes according to their defined order and include all of them in the repair result, thereby producing the repaired sequence. Even if some child nodes correspond to activities missing from the trace being repaired, they must still be included to maintain the semantics of sequential execution.
For example, consider the subtree depicted in Figure 5a and the trace to be repaired, denoted as σ2 = ⟨a⟩. Given that the sequential node enforces the execution order of its child nodes, the trace should be evaluated in the order a b. Since σ2 contains activity a, it is retained in the repaired trace. Although activity b is missing from σ2, it must still be added to preserve the intended execution path. Consequently, the repaired trace σ 2 is ⟨a, b⟩.
2.
Concurrent Node
The execution order of the child nodes within a concurrent node is arbitrary. Consequently, repairing a concurrent node can yield multiple equivalent repair results. The equivalence repair principle is applied, wherein any repair that includes all activities corresponding to the child nodes is considered valid because the sequence in which child nodes are executed does not influence the overall repair results.
In the repair process, the concurrent nodes are initially identified, followed by the individual handling of each child node. For each child node’s activities, any sequence that satisfies the constraints of the process model and encompasses all relevant activities is considered a valid repair result.
Given the subtree shown in Figure 5b and the trace σ3 = ⟨a, c⟩ to be repaired, the execution order of child nodes within a concurrent node is arbitrary. Consequently, traces containing activities a, b and c (such as ⟨a, b, c⟩, ⟨a, c, b⟩, ⟨b, a, c⟩) are all considered equivalent repair results. To determine the result, the process takes into account the position of activities within the child nodes. In this case, trace σ3 is missing activity (b). According to Figure 5b, b is situated between a and c, leading to the repair results σ 3 being ⟨a, b, c⟩.
3.
Choice Node
For a choice node, the repair process begins by calculating the intersection between the trace to be repaired and the branch index of the node. This step identifies the child nodes that need to be traversed and determines the repair sequence. A corresponding subtree is selected for repair only if the intersection is non-empty. If the intersection of a branch is empty, this indicates that the branch does not contain any activities from the trace to be repaired. In this case, the branch can be pruned to enhance repair efficiency.
Taking the subtree shown in Figure 5c and the trace σ4 = ⟨e⟩ to be repaired as an example, since the intersection of the index sequences of nodes v2.1 and v2.2 with σ4 is empty, subtrees (1) and (2) are pruned. The intersection of the index sequence of node v2.3, i3 σ4 = ⟨e⟩, subtree (3) is selected for repair. The final repair result σ 4 is ⟨e, f⟩.
4.
Loop Node
The repair of loop structures utilizes a breadth-first search algorithm to traverse child nodes and detect potential nesting relationships. For nested loops, the number of iterations for the inner loop is determined first, and repairs are performed based on a predefined repair strategy. The same strategy is then recursively applied to the outer loop. For non-nested loops, the same steps are directly executed to ensure that the repaired trace conforms to the expected flow model.
In accordance with the definition of the process tree, the execution count of the loop node’s branches satisfies the relationship r(N1) = r(N2) + 1, where N1 represents the left branch of the loop node, N2 represents the right branch, and r(▪) calculates the execution count of a branch. Existing research determined the execution count of loops based on the repetition frequency of activities [7], disregarding structural relationships between these activities, which resulted in poor performance during practical experiments. An enhanced method is introduced for calculating the execution counts of loop node branches, which considers both the frequency of activity occurrences and the characteristics of the related operation nodes, as shown in Algorithm 1. By separately determining the execution counts of N1 and the right subtree N2, the final execution count of the loop is taken as m = max{r(N1),r(N2) + 1}.
For loop nodes, it is necessary to traverse from bottom to top to accurately calculate the execution count of each operation node and ultimately determine the number of iterations for the loop node. The algorithm addresses the following three cases based on the type of operation node:
(1) When the current operation node is a choice node, all of its child nodes are traversed, and it is checked whether the activities contained in each child node appear in the trace. Each time a choice node is executed, only one child node is executed. Therefore, the execution count is recursively calculated (lines 3–14).
(2) When the current operation node is a concurrent node or a sequential node, the occurrence count of each child node is recorded. The maximum occurrence count among these is then used as the execution count of the node (lines 15–27).
(3) When the current operation node is a loop node, it represents a nested loop scenario. The execution count of the outer loop is determined by the intersection of the branch indices of the inner loop node with the trace that needs repair. If this intersection is non-empty, the execution count of the inter loop node is set to 1 (lines 28–32).
Algorithm 1: Loop iteration determination
Input: loop node Loop, trace t
Output: current loop node iteration count num
1. initial loop_count_list, num, loop_count = 0
2. while child in Loop.children do
3.  if child.operator == Operator.XOR then
4.   while xor_child in child do
5.    if xor_child.label ! = None then
6.     xor_join = intersection_of_lists(t, child.label)
7.     loop_count_xor += max_occurrences(xor_join, t)
8.    end if
9.    if xor_child.operator! = None then
10.     loop_count_xor += JudgeLoopCount(xor_join, xor_child)
11.    end if
12.   end while
13.  loop_count = loop_count_xor
14. end if
15. if child.operator == Operator.PARALLEL or Operator.SEQUENCE then
16.  while child_para_seq in child do
17.   if child_para_seq != None then
18.    loop_count_para_seq = max_occurrences(child_para_seq.label, t)
19.    loop_count_para_seq_list.add(loop_count_para_seq)
20.   end if
21.   if child_para_seq.operator != None then
22.    loop_count_para_seq = JudgeLoopCount(child_para_seq.label, t)
23.    loop_count_para_seq_list.add(loop_count_para_seq)
24.   end if
25.  end while
26.  loop_count = max(loop_count_para_seq_list)
27. end if
28. if child.operator == Operator.Loop then
29.  if Intersection_of_lists(child.label, t) then
30.   loop_count_loop = 1
31.  end if
32. end if
33. end while
34. return loop_count
Given the loop structure illustrated in Figure 6 and the trace σ5 = ⟨a, e, f, e, d, e, e, a⟩, we initially examine the execution count of the left subtree v2.1 within the loop node. As v2.1 is identified a choice node, we calculate r(v3.1) = 2 and r(v3.2) = 1. Therefore, the execution count for v2.1 is calculated as r(v2.1) = r(v3.1) + r(v3.2) = 3. Subsequently, the execution count of the right subtree v2.2 is analyzed. This count is established by intersecting its index sequence i2 with σ5. As the intersection i2 σ5 = ⟨e, f⟩ is non-empty, it follows that r(v2.2) = 1. Ultimately, the execution count of the loop node v1.1 is defined by max{r(v2.1), r(v2.2) + 1} = 3. In the repair process, the first iteration of the loop yields the sequence ⟨a, b, e, f, e⟩, while the second iteration produces ⟨c, d, e, f, e⟩. Upon termination of the third loop, only the left subtree is executed, producing the sequence ⟨a, b⟩. Consequently, the final repaired trace σ 5 is ⟨a, b, e, f, e, c, d, e, f, e, a, b⟩.

4. Experimental Evaluation

The event logs employed in the experiment are comprehensively described, along with the method for constructing its incomplete traces. Furthermore, a systematic analysis of the repair accuracy and time efficiency of the ITR method is presented. The experiments are executed within a Python3.7 environment, utilizing a 12 Gen Intel(R) Core (TM) i7-12650H 2.30 GHz, equipped with 16 GB RAM and an NVIDIA GeForce RTX 3070 Laptop GPU.
To ensure the persuasiveness of the experimental results and validate the robustness of the proposed method, nine event logs are used for the experiments. These event logs include four real logs (BPI_12, BPI_13, BPI_13_CP, and Helpdesk) and five synthetic logs (Syn1, Syn2, Syn3, Small_log, and Large_log) (https://github.com/wangqiushi175/Incomplete-Trace-Repair/tree/main/Datasets, accessed on 2 May 2025). Table 1 provides a comprehensive description of the key characteristics of these logs. In contrast to real logs, the synthetic logs are intentionally designed to be more structured, thereby offering more targeted support for the performance evaluation of the method.
Based on the classification method for missing data proposed by Schafer et al. [35], it is assumed that the missing event patterns in the event log are Missing Completely At Random (MCAR). This assumption implies that the probability of an event being missing is independent of both the observed data and any unobserved variables. To explore the impact of the trace missing rate T and activity missing rate P on the performance of incomplete trace repair methods, we provide the following definitions:
Trace missing rate (T): the proportion of missing traces within the event log relative to the total number of traces.
Activity missing rate (P): the proportion of missing events within each incomplete trace compared to the total number of events originally present in the trace.
When assessing the accuracy of incomplete trace repair, the trace missing rate T does not directly influence the repair results since the evaluation focuses on the quality of the repair for each individual trace. Consequently, the experiment fixes T at 20% while varying the activity missing rate P at four distinct levels: 20%, 30%, 40%, and 50%. These four values of P represent different degrees of missingness, facilitating a more comprehensive evaluation of the method’s performance under varying conditions of activity missing rates.
In order to evaluate the effectiveness of the ITR approach, we analyze each event log individually. In this context, let σ i represent an original trace within the log. The incomplete trace λ i is derived from σ i , and the activity sequence λ i denotes the repaired trace obtained through the repair method [8]. Given that σ i is the complete trace with the smallest edit distance to λ i , it serves as the benchmark for evaluation. If λ i is equal to σ i , the repair is deemed correct ( a i = 1); otherwise, it is considered incorrect ( a i = 0). Accuracy is utilized to measure the effectiveness of the incomplete trace repair method, and its calculation formula is as follows:
accuracy = i = 1 n a i / n
where n is the number of repaired incomplete traces.
In addition to the accuracy metric, the runtime overhead of each repair method is recorded to facilitate a comprehensive comparison of their computational efficiency.

4.1. Benchmark Approaches

To validate the accuracy and efficiency of the ITR method and assess its effectiveness in enhancing log quality, a series of comparative experiments were conducted. Due to the unavailability of most approaches in the existing literature, direct comparisons were not feasible. Consequently, we select the following three existing incomplete trace repair methods for comparison:
1. Autoencoder (AE) [36]: AE is a type of neural network designed to compress data into a low-dimensional representation and then reconstruct it back to its original dimension, minimizing the reconstruction error in the process to extract the important features of the data.
2. Variational autoencoder (VAE) [37]: VAE is a generative model that learns the latent probabilistic distribution of the data by imposing distributional constraints within the encoder. It subsequently generates new data that closely resemble the input data through the encoder.
3. Long short-term memory autoencoder (LAE) [38]: This method leverages the strengths of long short-term memory (LSTM) networks to handle sequential data by embedding them within an autoencoder architecture. This integration is specifically designed to capture the temporal dependencies inherent in activity sequences in event logs. Consequently, it enhances the model’s ability to learn and accurately reconstruct these sequences, improving the quality of trace restoration in the context of event log analysis.

4.2. Accuracy Results of Incomplete Trace Repair

Figure 7 illustrates the repair accuracy of the ITR approach in comparison to three other approaches across different event logs. The blue solid line with circular makers represents the performance of the ITR approach. The results demonstrate that the ITR approach outperforms the others at low missing rates (20% and 30%), as shown in Figure 7a–c. This highlights its efficiency and reliability in scenarios where the proportion of missing activities is relatively small.
As illustrated in Figure 7g–i, the relatively simple underlying process structures of these event logs enable the Inductive Miner to automatically discover accurate process models, thereby significantly enhancing trace repair effectiveness. Even under conditions of high missing rates, the repair accuracy for both Small_log and Large_log remains at 1.0. This result indicates that, for logs characterized by strong regularity and minimal unobservable transitions, the proposed method can consistently maintain a stable performance. These findings fully demonstrate the robustness and advantages of the ITR approach in effectively handling such event logs.
Notably, in Figure 7d–f, the ITR approach slightly lags behind the AE and VAE approaches. Nevertheless, it maintains a repair accuracy exceeding 0.8 at lower missing rates, demonstrating its capacity to effectively identify and repair incomplete traces in logs.
As observed in Figure 7a–f, the accuracy of the ITR approach declines significantly at higher missing rates. This can be attributed to the presence of numerous unobservable transitions in process models mined using IM algorithm. Since the ITR approach adheres to the principle of minimal change, it tends to favor these unobservable transitions, thereby affecting repair accuracy. It is important to note that the primary focus of this paper is the development of an effective approach for repairing incomplete traces rather than optimizing the discovered process models.

4.3. Time Performance of Incomplete Trace Repair

The time performance of the approaches compared on different event logs is illustrated in Figure 8, where the first column represents the time efficiency of the ITR approach. To facilitate a better visualization of the time overhead values, a logarithmic transformation is applied to the time cost data due to the wide range of values observed.
As shown in Figure 8, the ITR approach demonstrates superior time efficiency across all event logs when compared to the other three approaches. Specifically, its time overhead is significantly lower, achieving improvements in execution speed of 2 to 3 orders of magnitude. This highlights the ITR approach’s capability to handle large-scale logs with remarkable computational efficiency and robustness, making it highly suitable for scenarios where time performance is critical.
Additionally, the ITR approach exhibits high consistent performance stability. Its runtime remains nearly constant regardless of the missing rate, suggesting that the repair task can be completed by the ITR approach at a consistent speed, independent of the number of missing events within the logs.

5. Discussion

This section provides a comprehensive discussion of the ITR approach in relation to existing incomplete trace repair approaches, analyzing the experimental results to highlight both its strengths and limitations.

5.1. Comparison with Existing Approaches

As reviewed in Section 2, the existing approaches for repairing incomplete traces can be classified into four major categories: process-model-based, interpolation-based, neural-network-based, and trace-clustering-based approaches. Each category possesses distinct advantages as well as inherent weaknesses.
Process-model-based approaches, which predominantly utilize Petri nets or related models, ensure logical consistency with business constraints. However, these approaches often encounter efficiency bottlenecks, especially when addressing complex or nested loop structures. Despite enhancements such as process decomposition and trace segmentation, these techniques are susceptible to traversing irrelevant branches and suffering from state space explosion, thereby limiting their scalability and practical applicability.
Interpolation-based approaches leverage probabilistic reasoning or dependencies among log attributes to estimate missing activities. While potentially effective in data-rich environments, their performance declines significantly when confronted with sparse, noisy, or weakly correlated log attributes, resulting in reduced repair quality.
Neural-network-based approaches exploit the representational power of deep learning to capture intricate behavioral dependencies within event logs. Although these approaches have demonstrated high predictive accuracy in certain scenarios, they typically require prior knowledge of missing event positions or substantial manual annotation during training. This requirement hinders their adaptability and widespread deployment in dynamic or real-world business settings.
Trace-clustering-based approaches operate by identifying similarities among traces to guide the repair process. Such approaches perform adequately on regular, well-structured event logs with limited anomalies; however, their effectiveness diminishes rapidly under conditions of high noise levels or extensive missingness.
The ITR approach effectively addresses several critical challenges inherent in the above approaches. By adopting process trees instead of Petri nets, ITR circumvents the complexity and inefficiency associated with traditional alignment techniques. The integration of process tree decomposition and branch indexing facilitates efficient localization of candidate repair paths while single-traversal repair of loop nodes ensures both computational speed and repair accuracy—particularly in the context of complex and deeply nested loops.
Furthermore, compared to neural-network-based approaches, ITR eliminates the need for prior information regarding missing event positions, thereby enhancing interpretability and reducing domain-specific requirements.

5.2. Experimental Results and Limitations

The experimental results demonstrate that ITR consistently outperforms classical-model-based approaches in terms of both repair accuracy and computational efficiency, especially at low-to-moderate missing rates. On event logs characterized by strong regularity and a clear process structure, ITR maintains stable and high repair accuracy even as the missing rate increases, underscoring its robustness against moderate data loss.
When benchmarked against neural-network-based approaches such as AE and VAE, ITR occasionally exhibits slightly lower accuracy on specific event logs. Nevertheless, it achieves repair accuracies exceeding 0.8 at lower missing rates, indicating that, although deep learning approaches may excel in capturing subtle temporal dependencies under ideal data conditions, ITR offers a strong alternative that is less demanding in terms of data preparation and parameter tuning.
From a computational perspective, ITR demonstrates significant performance advantages, achieving runtime improvements of 2 to 3 orders of magnitude over the baseline approaches. Moreover, the execution time of ITR remains nearly invariant across different missing rates, which highlights its scalability and suitability for handling large-scale event log repair scenarios.
Despite these advantages, the current ITR approach primarily focuses on repairing missing activities, and its effectiveness in handling attribute-level incompleteness is still limited. In addition, further research is needed to optimize algorithmic performance when faced with extremely complex or massive event logs.
In summary, the ITR approach presents a robust and efficient solution for incomplete trace repair, outperforming many representative existing approaches, while also offering clear directions for future enhancement and application in diverse business environments.

6. Conclusions

The presence of incomplete traces significantly compromises the accuracy and reliability of process mining results, hindering the identification of potential bottlenecks during process analysis and adversely impacting overall business efficiency. To address this challenge, this paper proposes a novel process tree-based incomplete trace repair approach that systematically mitigates the negative impact of incomplete traces on the accuracy and reliability of process mining. Our approach leverages process tree decomposition, innovative branch indexing techniques, and comprehensive statistical analysis of activity information to efficiently identify repair candidates and precisely determine the execution counts of loop nodes, thus enabling a one-pass repair event for complex and deeply nested loops.
In contrast to existing approaches, the ITR approach not only enhances the effectiveness of trace repair but also substantially reduces computational overhead. Extensive experimental evaluations demonstrate that our approach achieves over 70% repair accuracy at low missing rates and delivers near-perfect results on simple logs. Furthermore, when applied to large-scale event logs, the ITR approach consistently maintains high repair accuracy while achieving efficiency improvements of 2 to 3 orders of magnitude compared to traditional methods. These results highlight the strong robustness, scalability, and practical applicability of our approach, establishing it as a significant advancement in the field of process mining.
As expected, our approach still exhibits certain limitations. Specifically, the ITR approach primarily addresses the repair of missing activities, while its capability in handling attribute-level incompleteness remains limited. Moreover, when dealing with extremely complex or large-scale event logs, there is still potential for further optimization in algorithmic performance and scalability. In future work, we plan to integrate statistical techniques with machine learning approaches, thereby enhancing the intelligence and adaptability of our approach and expanding its applicability to more challenging cases, such as attribute-level missing data. Additionally, we will investigate the combination of multiple-log-repair strategies to synergistically exploit their respective advantages, further improving the overall effectiveness, robustness, and flexibility of our solution in diverse real-world scenarios.

Author Contributions

Conceptualization, L.Z.; Data curation, H.Z.; Investigation, C.L.; Writing—original draft, Q.W.; Writing—review & editing, L.Z. and N.G.; Methodology, Q.W.; Resources, R.C.; Supervision, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used to evaluate the proposed method are publicly available; please refer to notes for links to access the datasets.

Conflicts of Interest

Author Haijun Zhang was employed by the company Jinan Inspur Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Liu, W.; Liu, C.; Wang, L.; Wen, L.; Zeng, Q. The token replay-based object-centric process conformance checking method. Comput. Integr. Manuf. Syst. 2025, 1–20. [Google Scholar] [CrossRef]
  2. Na, G.; Liu, C.; Li, C.; Ouyang, C.; Ni, W.; Zeng, Q. Causal inference-based root cause analysis method for business process time anomalies. Comput. Integr. Manuf. Syst. 2024, 1–17. [Google Scholar] [CrossRef]
  3. Ding, L.; Chen, S.; Rundensteiner, E.A.; Tatemura, J.; Hsiung, W.P.; Candan, K.S. Runtime semantic query optimization for event stream processing. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, Cancun, Mexico, 7–12 April 2008. [Google Scholar]
  4. Bezerra, F.; Wainer, J. Algorithms for anomaly detection of traces in logs of process aware information systems. Inf. Syst. 2013, 38, 33–44. [Google Scholar] [CrossRef]
  5. Sun, P.; Liu, Z.; Davidson, S.B.; Chen, Y. Detecting and resolving unsound workflow views for correct provenance analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, Providence, RI, USA, 29 June–2 July 2009. [Google Scholar]
  6. Wang, J.; Song, S.; Zhu, X.; Lin, X. Efficient recovery of missing events. Proc. VLDB Endow. 2013, 6, 841–852. [Google Scholar] [CrossRef]
  7. Li, H.; Liu, C.; Zhang, Z.; Shen, X.; Mo, Q.; Zeng, Q. Cross-organizational business process conformance checking and anomaly behavior diagnosis Approach. Comput. Integr. Manuf. Syst. 2024, 1–17. [Google Scholar] [CrossRef]
  8. Li, T.; Liu, C.; Xu, X.; Zhang, S.; Wen, L.; Lin, L.; Zeng, Q. The quality evaluation framework for business process concept drift detection algorithms. Comput. Integr. Manuf. Syst. 2024, 30, 2722–2734. [Google Scholar]
  9. Song, S.; Zhang, A. IoT data quality. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Seattle, WA, USA, 6–11 August 2020. [Google Scholar]
  10. Song, W.; Xia, X.; Jacobsen, H.A.; Zhang, P.; Hu, H. Efficient alignment between event logs and process models. IEEE Trans. Serv. Comput. 2016, 10, 136–149. [Google Scholar] [CrossRef]
  11. Song, W.; Xia, X.; Jacobsen, H.A.; Zhang, P.; Hu, H. Heuristic recovery of missing events in process logs. In Proceedings of the 2015 IEEE International Conference on Web Services, New York, NY, USA, 27 June–2 July 2015. [Google Scholar]
  12. Song, W.; Jacobsen, H.A.; Zhang, P. Self-healing event logs. IEEE Trans. Knowl. Data Eng. 2019, 33, 2750–2763. [Google Scholar] [CrossRef]
  13. Rogge-Solti, A.; Mans, R.S.; van der Aalst, W.M.; Weske, M. Improving documentation by repairing event logs. In Proceedings of the The Practice of Enterprise Modeling: 6th IFIP WG 8.1 Working Conference, Riga, Latvia, 6–7 November 2013. [Google Scholar]
  14. Sim, S.; Bae, H.; Choi, Y. Likelihood-based multiple imputation by event chain methodology for repair of imperfect event logs with missing data. In Proceedings of the 2019 International Conference on Process Mining, Aachen, Germany, 24–26 June 2019. [Google Scholar]
  15. Sim, S.; Bae, H.; Liu, L. Bagging recurrent event imputation for repair of imperfect event log with missing categorical events. IEEE Trans. Serv. Comput. 2021, 16, 108–121. [Google Scholar] [CrossRef]
  16. Nguyen, H.T.C.; Lee, S.; Kim, J.; Ko, J.; Comuzzi, M. Autoencoders for improving quality of process event logs. Expert Syst. Appl. 2019, 131, 132–147. [Google Scholar] [CrossRef]
  17. Nguyen, H.T.C.; Comuzzi, M. Event log reconstruction using autoencoders. In Proceedings of the Service-Oriented Computing–ICSOC 2018 Workshops, Hangzhou, China, 12–15 November 2018. [Google Scholar]
  18. Lu, Y.; Chen, Q.; Poon, S.K. A deep learning approach for repairing missing activity labels in event logs for process mining. Information 2022, 13, 234. [Google Scholar] [CrossRef]
  19. Fang, H.; Li, B. A Multi-Perspective and Interpretable Log Repairing Method Based on Two-Level Attention and Weak Behavioral Profiles. 2024. Available online: https://ssrn.com/abstract=4798515 (accessed on 2 May 2025).
  20. Wu, P.; Fang, X.; Fang, H.; Gong, Z.; Kan, D. An Event Log Repair Method Based on Masked Transformer Model. Appl. Artif. Intell. 2024, 38, 2346059. [Google Scholar] [CrossRef]
  21. Liu, J.; Xu, J.; Zhang, R.; Reiff-Marganiec, S. A repairing missing activities approach with succession relation for event logs. Knowl. Inf. Syst. 2021, 63, 477–495. [Google Scholar] [CrossRef]
  22. Fani Sani, M.; van Zelst, S.J.; van der Aalst, W.M. Repairing outlier behaviour in event logs. In Proceedings of the Business Information Systems: 21st International Conference, Berlin, Germany, 18–20 July 2018. [Google Scholar]
  23. Xu, J.; Liu, J. A profile clustering based event logs repairing approach for process mining. IEEE Access 2019, 7, 17872–17881. [Google Scholar] [CrossRef]
  24. Fang, H.; Su, W. Log Clustering-based Method for Repairing Missing Traces with Context Probability Information. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 1445–1452. [Google Scholar] [CrossRef]
  25. Van der Aalst, W.; Weijters, T.; Maruster, L. Workflow mining: Discovering process models from event logs. IEEE Trans. Knowl. Data Eng. 2004, 16, 1128–1142. [Google Scholar] [CrossRef]
  26. Weijters, A.J.M.M.; Ribeiro, J.T.S. Flexible heuristics miner (FHM). In Proceedings of the 2011 IEEE Symposium on Computational Intelligence and Data Mining, Paris, France, 11–15 April 2011. [Google Scholar]
  27. Ghionna, L.; Greco, G.; Guzzo, A.; Pontieri, L. Outlier detection techniques for process mining applications. In Proceedings of the Foundations of Intelligent Systems: 17th International Symposium, Toronto, ON, Canada, 20–23 May 2008. [Google Scholar]
  28. Suriadi, S.; Andrews, R.; ter Hofstede, A.H.; Wynn, M.T. Event log imperfection patterns for process mining: Towards a systematic approach to cleaning event logs. Inf. Syst. 2017, 64, 132–150. [Google Scholar] [CrossRef]
  29. Bernard, G.; Andritsos, P. Truncated trace classifier. removal of incomplete traces from event logs. In Proceedings of the Enterprise, Business-Process and Information Systems Modeling: 21st International Conference, Grenoble, France, 8–9 June 2020. [Google Scholar]
  30. Leemans, S.J.; Fahland, D.; Van Der Aalst, W.M. Discovering block-structured process models from event logs-a constructive approach. In Proceedings of the International Conference on Applications and Theory of Petri Nets and Concurrency, Milan, Italy, 24–28 June 2013. [Google Scholar]
  31. Yan, J.; Liu, C.; Su, X.; Wen, L.; Cao, J.; Zeng, Q. Inductive model mining method based on local process structure optimization. Comput. Integr. Manuf. Syst. 2025, 1–17. [Google Scholar] [CrossRef]
  32. Shen, X.; Liu, C.; Li, H.; Zheng, K.; Cheng, L.; Zeng, Q. Distributed compliance checking method based on process model decomposition. Comput. Integr. Manuf. Syst. 2024, 30, 2884–2896. [Google Scholar]
  33. Bohannon, P.; Fan, W.; Flaster, M.; Rastogi, R. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 14–16 June 2005. [Google Scholar]
  34. Song, S.; Cheng, H.; Yu, J.X.; Chen, L. Repairing vertex labels under neighborhood constraints. Proc. VLDB Endow. 2014, 7, 987–998. [Google Scholar] [CrossRef]
  35. Schafer, J.L.; Graham, J.W. Missing data: Our view of the state of the art. Psychol. Methods 2002, 7, 147. [Google Scholar] [CrossRef]
  36. Ehrhardt, J.; Wilms, M. Autoencoders and variational autoencoders in medical image analysis. In Biomedical Image Synthesis and Simulation; Academic Press: Cambridge, MA, USA, 2022; pp. 129–162. [Google Scholar]
  37. Pinheiro Cinelli, L.; Araújo Marins, M.; Barros da Silva, E.A.; Lima Netto, S. Variational autoencoder. In Variational Methods for Machine Learning with Applications to Deep Networks; Springer International Publishing: Cham, Switzerland, 2021; pp. 111–149. [Google Scholar]
  38. Pu, Y.; Gan, Z.; Henao, R.; Yuan, X.; Li, C.; Stevens, A.; Carin, L. Variational autoencoder for deep learning of images, labels and captions. Adv. Neural Inf. Process. Syst. 2016, 29, 35. [Google Scholar]
Figure 1. Framework for incomplete trace repair approach.
Figure 1. Framework for incomplete trace repair approach.
Information 16 00390 g001
Figure 2. An example of a process tree.
Figure 2. An example of a process tree.
Information 16 00390 g002
Figure 3. Decomposition of the process tree. (a) Original process tree; (b,c) are two subprocess trees obtained through the process tree decomposition method.
Figure 3. Decomposition of the process tree. (a) Original process tree; (b,c) are two subprocess trees obtained through the process tree decomposition method.
Information 16 00390 g003
Figure 4. Process of branch indexing. (a,b) are two branches of v1.2, where (a) is pruned due to not meeting the repair conditions, while (b) is selected for repair.
Figure 4. Process of branch indexing. (a,b) are two branches of v1.2, where (a) is pruned due to not meeting the repair conditions, while (b) is selected for repair.
Information 16 00390 g004
Figure 5. Repair process for sequential, concurrent, and choice nodes.
Figure 5. Repair process for sequential, concurrent, and choice nodes.
Information 16 00390 g005
Figure 6. Repair process for loop node.
Figure 6. Repair process for loop node.
Information 16 00390 g006
Figure 7. Accuracy of incomplete trace repair approaches.
Figure 7. Accuracy of incomplete trace repair approaches.
Information 16 00390 g007
Figure 8. Runtime of incomplete trace repair approaches.
Figure 8. Runtime of incomplete trace repair approaches.
Information 16 00390 g008
Table 1. Basic information of event logs.
Table 1. Basic information of event logs.
DatasetsTracesEventsActivitiesMaximumMinimum
BPIC_1213,087164,50623963
BPIC_13755465,533131231
BPIC_13_CP148766607351
Helpdesk380413,7309141
Syn110,505294,140162828
Syn22005778202535
Syn31000736518105
Small_log200028,000141414
Large_log15,000120,0001088
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Q.; Zhang, L.; Cao, R.; Guo, N.; Zhang, H.; Liu, C. A Process Tree-Based Incomplete Event Log Repair Approach. Information 2025, 16, 390. https://doi.org/10.3390/info16050390

AMA Style

Wang Q, Zhang L, Cao R, Guo N, Zhang H, Liu C. A Process Tree-Based Incomplete Event Log Repair Approach. Information. 2025; 16(5):390. https://doi.org/10.3390/info16050390

Chicago/Turabian Style

Wang, Qiushi, Liye Zhang, Rui Cao, Na Guo, Haijun Zhang, and Cong Liu. 2025. "A Process Tree-Based Incomplete Event Log Repair Approach" Information 16, no. 5: 390. https://doi.org/10.3390/info16050390

APA Style

Wang, Q., Zhang, L., Cao, R., Guo, N., Zhang, H., & Liu, C. (2025). A Process Tree-Based Incomplete Event Log Repair Approach. Information, 16(5), 390. https://doi.org/10.3390/info16050390

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop